Ambrosia in action. Building a simple A/B pipeline on synthetic data

In this example, a short, complete experimental pipeline will be built using various parts of Ambrosia. Synthetically generated one week data of daily content views by users is used.

The tutorial will be useful for building a general understanding about building A/B pipelines and using the tools from Ambrosia.
We will not discuss the choice of hypothesis, criteria, or the logic behind certain parameter values.
[2]:
import pandas as pd

from ambrosia.preprocessing import AggregatePreprocessor
from ambrosia.designer import Designer
from ambrosia.splitter import Splitter
from ambrosia.tester import Tester
Your CPU supports instructions that this binary was not compiled to use: AVX2
For maximum performance, you can install NMSLIB from sources
pip install --no-binary :all: nmslib

Load data

[3]:
dataframe = pd.read_csv('../tests/test_data/week_metrics.csv')
dataframe.head()
[3]:
id gender watched sessions day platform
0 0 Male 28.440846 4 1 android
1 1 Female 1.825271 2 1 ios
2 2 Female 46.995606 0 1 web
3 3 Female 37.310264 1 1 ios
4 4 Female 147.513105 0 1 web

Aggregate data

We would like to run a fixed-horizon A/B test in which we will observe the weekly metrics of users, and for the design of the experiment we have historical data of the size of a week. (For the real experiments historical data of one week is not enougth in that case)

First, we need to aggregate the metrics by users in order to bring the metrics of the objects in the dataset to the desired form and so that rows become independent of each other.

[4]:
transformer = AggregatePreprocessor()
[5]:
df = transformer.fit_transform(dataframe,
                               groupby_columns='id',
                               agg_params={
                                   'watched': 'sum',
                                   'sessions': 'max',
                                   'gender': 'simple',
                                   'platform': 'mode'
                               })
[6]:
df.head()
[6]:
id watched sessions gender platform
0 0 772.597224 4 Male ios
1 1 538.076739 6 Female android
2 2 288.492353 7 Female android
3 3 373.620408 3 Female ios
4 4 630.238862 8 Female ios

Design A/B test parameters

Let’s conduct an experiment design, suppose we want to catch a 5% effect on the watched metric with standard I and II type statistical errors.
How many users should be in each experimental group for that scenario?

We will use theoretical approach for the parameters calculation, and after the end of the experiment we will apply the two sample independent t-test as a statistical criterion.

[7]:
designer = Designer(dataframe=df, metrics='watched')
[8]:
designer.run('size', method='theory', effects=1.05)
[8]:
Errors ($\alpha$, $\beta$) (0.05; 0.2)
Effect
5.0% 894

For our experiment, a number of about 900 objects in each experimental group is sufficient

Split groups

In our business scenario, we don’t need a real-time splitting system and we can use batch group split. We will use the same data frame as a complete database containing unique object IDs and some useful data.

Let’s make a group split of the calculated size, that considers gender and platform variables stratification. Hash split approach will be used to get the deterministic split result.

[9]:
splitter = Splitter(dataframe=df,
                    strat_columns=['gender', 'platform'],
                    fit_columns=['sessions'])
[10]:
splitted_groups = splitter.run(groups_size=900, method='hash', salt='exp_322')
[11]:
splitted_groups
[11]:
id watched sessions gender platform group
1 1 538.076739 6 Female android A
6 6 516.444015 10 Female android A
10 10 678.150205 3 Female android A
31 31 638.889779 11 Female android A
49 49 441.192430 5 Female android A
... ... ... ... ... ... ...
378 378 1217.191864 5 Male android A
258 258 1356.446101 3 Female ios A
1973 1973 662.959150 8 Female android B
4324 4324 610.512075 5 Male web B
4791 4791 607.091209 22 Male web B

1800 rows × 6 columns

Objects with these identifiers will fall into the corresponding groups. Let’s wait for the end of the experiment and look at the result.

Result measurment

The experiment ended and we received data on daily metrics in both groups for a week.
Let’s aggregate data to weekly values and examine for statistically significant changes.
[12]:
experiment_result = pd.read_csv('../tests/test_data/watch_result.csv')
experiment_result.head()
[12]:
id watched group day
0 1708 349.581133 A 1
1 24 124.224169 A 1
2 1692 14.812922 A 1
3 185 179.607284 A 1
4 205 349.539016 A 1

Aggregate

[13]:
transformer = AggregatePreprocessor(real_method='sum')
[14]:
df_to_test = transformer.fit_transform(dataframe=experiment_result,
                                       groupby_columns='id',
                                       real_cols='watched',
                                       categorial_cols='group')
[15]:
df_to_test
[15]:
id watched group
0 6 597.833362 A
1 11 549.314234 A
2 20 564.401942 A
3 21 248.735358 A
4 23 926.048946 B
... ... ... ...
1795 4987 454.662125 A
1796 4988 404.600192 B
1797 4997 594.629770 B
1798 4998 1025.918249 B
1799 4999 737.005009 B

1800 rows × 3 columns

Evaluate the result and calculate relative effect with the corresponding CI

[16]:
tester = Tester(dataframe=df_to_test, metrics='watched', column_groups='group')
[17]:
tester.run(effect_type='relative', method='theory')
[17]:
first_type_error pvalue effect confidence_interval metric name group A label group B label
0 0.05 0.00004 0.079901 (0.0419, 0.1183) watched A B

For the chosen I type error we obtained a statistically significant result, with a point estimate of the effect of about ~8%