Ambrosia in action. Building a simple A/B pipeline on synthetic data¶

In this example, a short, complete experimental pipeline will be built using various parts of Ambrosia. Synthetically generated one week data of daily content views by users is used.

The tutorial will be useful for building a general understanding about building A/B pipelines and using the tools from Ambrosia.

We will not discuss the choice of hypothesis, criteria, or the logic behind certain parameter values.

[2]:

import pandas as pd

from ambrosia.preprocessing import AggregatePreprocessor
from ambrosia.designer import Designer
from ambrosia.splitter import Splitter
from ambrosia.tester import Tester

Your CPU supports instructions that this binary was not compiled to use: AVX2
For maximum performance, you can install NMSLIB from sources
pip install --no-binary :all: nmslib

Load data

[3]:

dataframe = pd.read_csv('../tests/test_data/week_metrics.csv')
dataframe.head()

[3]:

	id	gender	watched	sessions	day	platform
0	0	Male	28.440846	4	1	android
1	1	Female	1.825271	2	1	ios
2	2	Female	46.995606	0	1	web
3	3	Female	37.310264	1	1	ios
4	4	Female	147.513105	0	1	web

Aggregate data¶

We would like to run a fixed-horizon A/B test in which we will observe the weekly metrics of users, and for the design of the experiment we have historical data of the size of a week. (For the real experiments historical data of one week is not enougth in that case)

First, we need to aggregate the metrics by users in order to bring the metrics of the objects in the dataset to the desired form and so that rows become independent of each other.

[4]:

transformer = AggregatePreprocessor()

[5]:

df = transformer.fit_transform(dataframe,
                               groupby_columns='id',
                               agg_params={
                                   'watched': 'sum',
                                   'sessions': 'max',
                                   'gender': 'simple',
                                   'platform': 'mode'
                               })

[6]:

df.head()

[6]:

	id	watched	sessions	gender	platform
0	0	772.597224	4	Male	ios
1	1	538.076739	6	Female	android
2	2	288.492353	7	Female	android
3	3	373.620408	3	Female	ios
4	4	630.238862	8	Female	ios

Design A/B test parameters¶

Let’s conduct an experiment design, suppose we want to catch a 5% effect on the watched metric with standard I and II type statistical errors.

How many users should be in each experimental group for that scenario?

We will use theoretical approach for the parameters calculation, and after the end of the experiment we will apply the two sample independent t-test as a statistical criterion.

[7]:

designer = Designer(dataframe=df, metrics='watched')

[8]:

designer.run('size', method='theory', effects=1.05)

[8]:

Errors ($\alpha$, $\beta$)	(0.05; 0.2)
Effect
5.0%	894

For our experiment, a number of about 900 objects in each experimental group is sufficient

Split groups¶

In our business scenario, we don’t need a real-time splitting system and we can use batch group split. We will use the same data frame as a complete database containing unique object IDs and some useful data.

Let’s make a group split of the calculated size, that considers gender and platform variables stratification. Hash split approach will be used to get the deterministic split result.

[9]:

splitter = Splitter(dataframe=df,
                    strat_columns=['gender', 'platform'],
                    fit_columns=['sessions'])

[10]:

splitted_groups = splitter.run(groups_size=900, method='hash', salt='exp_322')

[11]:

splitted_groups

[11]:

	id	watched	sessions	gender	platform	group
1	1	538.076739	6	Female	android	A
6	6	516.444015	10	Female	android	A
10	10	678.150205	3	Female	android	A
31	31	638.889779	11	Female	android	A
49	49	441.192430	5	Female	android	A
...	...	...	...	...	...	...
378	378	1217.191864	5	Male	android	A
258	258	1356.446101	3	Female	ios	A
1973	1973	662.959150	8	Female	android	B
4324	4324	610.512075	5	Male	web	B
4791	4791	607.091209	22	Male	web	B

1800 rows × 6 columns

Objects with these identifiers will fall into the corresponding groups. Let’s wait for the end of the experiment and look at the result.

Result measurment¶

The experiment ended and we received data on daily metrics in both groups for a week.

Let’s aggregate data to weekly values and examine for statistically significant changes.

[12]:

experiment_result = pd.read_csv('../tests/test_data/watch_result.csv')
experiment_result.head()

[12]:

	id	watched	group	day
0	1708	349.581133	A	1
1	24	124.224169	A	1
2	1692	14.812922	A	1
3	185	179.607284	A	1
4	205	349.539016	A	1

Aggregate

[13]:

transformer = AggregatePreprocessor(real_method='sum')

[14]:

df_to_test = transformer.fit_transform(dataframe=experiment_result,
                                       groupby_columns='id',
                                       real_cols='watched',
                                       categorial_cols='group')

[15]:

df_to_test

[15]:

	id	watched	group
0	6	597.833362	A
1	11	549.314234	A
2	20	564.401942	A
3	21	248.735358	A
4	23	926.048946	B
...	...	...	...
1795	4987	454.662125	A
1796	4988	404.600192	B
1797	4997	594.629770	B
1798	4998	1025.918249	B
1799	4999	737.005009	B

1800 rows × 3 columns

Evaluate the result and calculate relative effect with the corresponding CI

[16]:

tester = Tester(dataframe=df_to_test, metrics='watched', column_groups='group')

[17]:

tester.run(effect_type='relative', method='theory')

[17]:

	first_type_error	pvalue	effect	confidence_interval	metric name	group A label	group B label
0	0.05	0.00004	0.079901	(0.0419, 0.1183)	watched	A	B

For the chosen I type error we obtained a statistically significant result, with a point estimate of the effect of about ~8%