Ambrosia in action. Building a simple A/B pipeline on synthetic data¶
In this example, a short, complete experimental pipeline will be built using various parts of Ambrosia. Synthetically generated one week data of daily content views by users is used.
[2]:
import pandas as pd
from ambrosia.preprocessing import AggregatePreprocessor
from ambrosia.designer import Designer
from ambrosia.splitter import Splitter
from ambrosia.tester import Tester
Your CPU supports instructions that this binary was not compiled to use: AVX2
For maximum performance, you can install NMSLIB from sources
pip install --no-binary :all: nmslib
Load data
[3]:
dataframe = pd.read_csv('../tests/test_data/week_metrics.csv')
dataframe.head()
[3]:
| id | gender | watched | sessions | day | platform | |
|---|---|---|---|---|---|---|
| 0 | 0 | Male | 28.440846 | 4 | 1 | android |
| 1 | 1 | Female | 1.825271 | 2 | 1 | ios |
| 2 | 2 | Female | 46.995606 | 0 | 1 | web |
| 3 | 3 | Female | 37.310264 | 1 | 1 | ios |
| 4 | 4 | Female | 147.513105 | 0 | 1 | web |
Aggregate data¶
We would like to run a fixed-horizon A/B test in which we will observe the weekly metrics of users, and for the design of the experiment we have historical data of the size of a week. (For the real experiments historical data of one week is not enougth in that case)
First, we need to aggregate the metrics by users in order to bring the metrics of the objects in the dataset to the desired form and so that rows become independent of each other.
[4]:
transformer = AggregatePreprocessor()
[5]:
df = transformer.fit_transform(dataframe,
groupby_columns='id',
agg_params={
'watched': 'sum',
'sessions': 'max',
'gender': 'simple',
'platform': 'mode'
})
[6]:
df.head()
[6]:
| id | watched | sessions | gender | platform | |
|---|---|---|---|---|---|
| 0 | 0 | 772.597224 | 4 | Male | ios |
| 1 | 1 | 538.076739 | 6 | Female | android |
| 2 | 2 | 288.492353 | 7 | Female | android |
| 3 | 3 | 373.620408 | 3 | Female | ios |
| 4 | 4 | 630.238862 | 8 | Female | ios |
Design A/B test parameters¶
watched metric with standard I and II type statistical errors.We will use theoretical approach for the parameters calculation, and after the end of the experiment we will apply the two sample independent t-test as a statistical criterion.
[7]:
designer = Designer(dataframe=df, metrics='watched')
[8]:
designer.run('size', method='theory', effects=1.05)
[8]:
| Errors ($\alpha$, $\beta$) | (0.05; 0.2) |
|---|---|
| Effect | |
| 5.0% | 894 |
For our experiment, a number of about 900 objects in each experimental group is sufficient
Split groups¶
In our business scenario, we don’t need a real-time splitting system and we can use batch group split. We will use the same data frame as a complete database containing unique object IDs and some useful data.
Let’s make a group split of the calculated size, that considers gender and platform variables stratification. Hash split approach will be used to get the deterministic split result.
[9]:
splitter = Splitter(dataframe=df,
strat_columns=['gender', 'platform'],
fit_columns=['sessions'])
[10]:
splitted_groups = splitter.run(groups_size=900, method='hash', salt='exp_322')
[11]:
splitted_groups
[11]:
| id | watched | sessions | gender | platform | group | |
|---|---|---|---|---|---|---|
| 1 | 1 | 538.076739 | 6 | Female | android | A |
| 6 | 6 | 516.444015 | 10 | Female | android | A |
| 10 | 10 | 678.150205 | 3 | Female | android | A |
| 31 | 31 | 638.889779 | 11 | Female | android | A |
| 49 | 49 | 441.192430 | 5 | Female | android | A |
| ... | ... | ... | ... | ... | ... | ... |
| 378 | 378 | 1217.191864 | 5 | Male | android | A |
| 258 | 258 | 1356.446101 | 3 | Female | ios | A |
| 1973 | 1973 | 662.959150 | 8 | Female | android | B |
| 4324 | 4324 | 610.512075 | 5 | Male | web | B |
| 4791 | 4791 | 607.091209 | 22 | Male | web | B |
1800 rows × 6 columns
Objects with these identifiers will fall into the corresponding groups. Let’s wait for the end of the experiment and look at the result.
Result measurment¶
[12]:
experiment_result = pd.read_csv('../tests/test_data/watch_result.csv')
experiment_result.head()
[12]:
| id | watched | group | day | |
|---|---|---|---|---|
| 0 | 1708 | 349.581133 | A | 1 |
| 1 | 24 | 124.224169 | A | 1 |
| 2 | 1692 | 14.812922 | A | 1 |
| 3 | 185 | 179.607284 | A | 1 |
| 4 | 205 | 349.539016 | A | 1 |
Aggregate
[13]:
transformer = AggregatePreprocessor(real_method='sum')
[14]:
df_to_test = transformer.fit_transform(dataframe=experiment_result,
groupby_columns='id',
real_cols='watched',
categorial_cols='group')
[15]:
df_to_test
[15]:
| id | watched | group | |
|---|---|---|---|
| 0 | 6 | 597.833362 | A |
| 1 | 11 | 549.314234 | A |
| 2 | 20 | 564.401942 | A |
| 3 | 21 | 248.735358 | A |
| 4 | 23 | 926.048946 | B |
| ... | ... | ... | ... |
| 1795 | 4987 | 454.662125 | A |
| 1796 | 4988 | 404.600192 | B |
| 1797 | 4997 | 594.629770 | B |
| 1798 | 4998 | 1025.918249 | B |
| 1799 | 4999 | 737.005009 | B |
1800 rows × 3 columns
Evaluate the result and calculate relative effect with the corresponding CI
[16]:
tester = Tester(dataframe=df_to_test, metrics='watched', column_groups='group')
[17]:
tester.run(effect_type='relative', method='theory')
[17]:
| first_type_error | pvalue | effect | confidence_interval | metric name | group A label | group B label | |
|---|---|---|---|---|---|---|---|
| 0 | 0.05 | 0.00004 | 0.079901 | (0.0419, 0.1183) | watched | A | B |
For the chosen I type error we obtained a statistically significant result, with a point estimate of the effect of about ~8%