Example of the `Designer` class usage for A/B test parameters calculation¶

This tutorial will review Ambrosia’s experiments design tools using an example of calculating the parameters of a hypothetical A/B test. For this, synthetic data on MTS KION users metrics will be used.

Before we start looking at the tools, here is a short list of questions and answers to help understand some of the experiment design essentials

Note: In this tutorial fixed-horizon experiments are assumed. For this kind of experiments, decisions are made based on the results obtained at the end of the planned test duration.

What is needed before designing A/B test parameters?¶

Before the experiment it is good to have:

Formulated and fixed hypothesis
One or a set of metrics that meet all the requirements of the task, and on which the conclusion will be drawn
A fixed plan of the decision-making process based on the results of the experiment

Also, for a calculation of the test parameters itself, we need to have historical data on selected metrics.

What parameters does usual A/B test have?¶

In usual A/B test we use some statistical criteria that tests our hypothesis, and there are four related parameters for the experimental setup:

I type error (alpha) - probability of false success of the criterion in the absence of real changes, 1 - alpha is called statistical significance
II type error (beta) - probability of false failure of the criterion in the presence of real changes, 1 - beta is called statistical power
Groups sizes - number of objects in each experimental group, converted to the duration of the experiment using traffic
Minimal detectable effect (MDE) - is the smallest true effect from a change that has a certain level of statistical power for a certain level of statistical significance

Note: For tests with multigroups or sets of metrics, this must be taken into account in an appropriate way when calculating the parameters of the experiment.

Why one need to calculate A/B test parameters?¶

This is necessary to obtain correct and expected results from the experiment.

Nobody wants to run an experiment longer than necessary or get results with low statistical power.

Basically, researches fix I type error at some level (industry default is 0.05) and try to maximize statistical power of test under the existing limitations of business environment .

These limitations usually include:

Test duration limitation due to risks of implemented change negative impact
Test duration limitation because of test costs
Group sizes limitation due to limits of available objects pool or traffic channel
MDE limitation due to it’s minimal reasonable size
MDE limitation due to weak impact of the change on the tested metric
MDE limitation due to development costs of implemented change
Costs of I and II type errors, which limits of fixes these values

For example, there may be such a statement of the design problem:

What is the minimal detectable effect on the metric that we can detect with given errors of type I and II, if the size of our groups is fixed?

How parameters can be calculated?¶

Ambrosia offers two approaches to calculate experiment parameters using metric historical data.

Theoretical approach

First method is based on the results of the analytical formula for the difference of normally distributed quantities. This method is very fast because it only requires the value of the mean and variance of the empirical distribution of the metric, and is recommended for first use.

Don’t worry if your metric isn’t distributed normally, for a large enough group the CLT will work for you. However, to obtain completely correct results, it is necessary to check the nominal coverage of the corresponding confidence intervals.

You can read more about this theoretical formula here or in other sources.

Empirical approach

The theoretical approach is fast and convenient, but does not take into account your specific criteria and all the features of the distribution of metric values.

The empirical method allows parameters to be calculated by repeatedly sampling the groups from the passed historical data, modeling the effect on the test group, and applying the selected statistical test on large number of group pairs. Thus, the statistical power can be estimated empirically and other parameters optimally matched.

This method is more computationally consuming and can give noisy results in parameters estimation with a small number of sampled groups.

Note: For binary metrics empirical approach is not suitable. You can choose binary method which solves inverse problem by constructing a large number of binary confidence intervals.

This method has its own features, see a separate example with the design of binary metrics.

Now, let’s start the tutorial¶

[2]:

import numpy as np
import pandas as pd

import yaml

from ambrosia.designer import Designer, design, load_from_config

Load data

[3]:

data = pd.read_csv('../tests/test_data/kion_data.csv', sep=';')

[4]:

data.head()

[4]:

	profile_id	sum_dur	vod_cnt	ln_vod_cnt	bin_col
0	99402893794	20104282	83	5.533356	1
1	878511937265	3986136	53	4.807294	1
2	998929369788	2063965	22	3.187069	1
3	265028786131	523539	14	2.679252	1
4	995182338752	1588224	19	4.177776	1

The Designer class is Ambrosia’s main tool for calculating experimental parameters. It has one main public method run() which returns the table with calculated parameters of the test.

Let’s create an instance of the class and pass to the constructor a dataframe with historical data about the metrics that we will design, in our case, this is the total duration of viewing the content sum_dur per user.

[5]:

designer = Designer(dataframe=data, metrics='sum_dur')

In fact, we can pass this dataframe and metrics later as an argument to the run() method. We can do the same with most of the parameters related directly to the experiment (errors, effects, and so on) - either pass them to the constructor during initialization (and then they will become attributes of the created instance), or pass them later, when execute run() method. In case of parameter selection ambiguity, the argument in the method takes precedence over the attribute value.

Theoretical design¶

Now we will calculate the parameters of the experiment using theoretical approach and grid of other known params

[6]:

### Set parameters grid
effects = [1.05, 1.1, 1.2]  # MDE in percents
sizes = [1000, 3000, 7000]  # Size of each group
first_type_errors = [0.01, 0.05]
second_type_errors = [0.1, 0.2]

Calculate MDE

[7]:

designer.run(to_design='effect',
             method='theory',
             first_type_errors=first_type_errors,
             second_type_errors=second_type_errors,
             sizes=sizes)

[7]:

Errors ($\alpha$, $\beta$)	(0.01; 0.1)	(0.01; 0.2)	(0.05; 0.1)	(0.05; 0.2)
Group sizes
1000	61.1%	54.2%	51.4%	44.4%
3000	35.3%	31.3%	29.6%	25.6%
7000	23.1%	20.5%	19.4%	16.8%

We will use these error rates further, so let’s set them using setters

[8]:

designer.set_first_errors(first_type_errors)
designer.set_second_errors(second_type_errors)

Now calculate group sizes

[9]:

designer.run(to_design='size', method='theory', effects=effects)

[9]:

Errors ($\alpha$, $\beta$)	(0.01; 0.1)	(0.01; 0.2)	(0.05; 0.1)	(0.05; 0.2)
Effect
5.0%	149323	117206	105448	78768
10.0%	37332	29303	26363	19693
20.0%	9335	7327	6592	4924

Finally calculate statistical power

[10]:

designer.run(to_design='power', method='theory', effects=effects, sizes=sizes)

[10]:

	Group sizes	1000	3000	7000
$\alpha$	Effect
0.01	5.0%	1.4%	2.2%	4.1%
	10.0%	2.7%	6.9%	18.3%
	20.0%	9.4%	34.8%	77.8%
0.05	5.0%	6.1%	8.5%	13.3%
	10.0%	9.7%	19.4%	38.6%
	20.0%	24.3%	59.0%	91.6%

We can change alternative, by default it is "two-sided", now we want test only positive changes

[11]:

designer.run(to_design='power',
             method='theory',
             effects=effects,
             sizes=sizes,
             alternative='greater')

[11]:

	Group sizes	1000	3000	7000
$\alpha$	Effect
0.01	5.0%	2.2%	3.8%	6.8%
	10.0%	4.5%	10.9%	25.6%
	20.0%	14.4%	44.4%	84.5%
0.05	5.0%	9.2%	13.6%	20.9%
	10.0%	15.5%	29.1%	51.0%
	20.0%	35.1%	70.6%	95.5%

Parameter groups_ratio allows to make groups sizes unequal. The size of group B is equal to the size of group A multiplied by groups_ratio value. By default, it is equal to 1.0.

Let’s make calculation of required size for group A : group B in proportion of 10 : 1. The output group size calculation results show us the size of group A

[12]:

designer.run(to_design='size',
             method='theory',
             effects=effects,
             sizes=sizes,
             groups_ratio=0.1)

[12]:

Errors ($\alpha$, $\beta$)	(0.01; 0.1)	(0.01; 0.2)	(0.05; 0.1)	(0.05; 0.2)
Effect
5.0%	821269	644622	579958	433219
10.0%	205320	161158	144991	108306
20.0%	51333	40292	36249	27078

Empirical design¶

Now we will change design method to empiric and calculate group sizes by conducting a lot of pseudo A/B tests on historical data.

As a default statistical criterion, the Designer uses the two-sample independent T-test.

To limit computational cost we will set bs_samples parameter to a low value. This parameter determines how many pseudo A/B tests we will conduct to evaluate one value of the parameter, and high values (use at least >1000) will give more accurate estimation of parameters.

We will also use multiprocessing to speed up calculations and set the value of n_jobs to 4 (by default it is equal to 1).

[13]:

designer.run(to_design='size',
             method='empiric',
             effects=effects,
             bs_samples=100,
             n_jobs=4)

[13]:

Errors ($\alpha$, $\beta$)	(0.01, 0.1)	(0.01, 0.2)	(0.05, 0.1)	(0.05, 0.2)
Effect
5.0%	153569	137126	117300	73706
10.0%	41096	34920	27711	21503
20.0%	10299	8827	7639	5822

Statistical criterion can be changed using corresponding parameter criterion

[14]:

designer.run(to_design='size',
             method='empiric',
             effects=effects,
             criterion='mw',
             bs_samples=100,
             n_jobs=4)

[14]:

Errors ($\alpha$, $\beta$)	(0.01, 0.1)	(0.01, 0.2)	(0.05, 0.1)	(0.05, 0.2)
Effect
5.0%	66810	58589	46579	39088
10.0%	15249	12426	10748	10069
20.0%	4247	3891	2979	2340

We can use bootstrap criterion to calculate some parameter

[25]:

designer.run(to_design='power',
             method='empiric',
             effects=effects,
             sizes=sizes,
             criterion='bootstrap',
             bs_samples=1000,
             n_jobs=4)

[25]:

	Group sizes	(1000, 1000)	(3000, 3000)	(7000, 7000)
$\alpha$	Effect
0.01	5.0%	2.2%	3.9%	5.1%
	10.0%	4.5%	7.8%	19.7%
	20.0%	9.9%	31.5%	70.8%
0.05	5.0%	7.6%	9.8%	13.9%
	10.0%	10.6%	19.7%	37.5%
	20.0%	24.4%	57.1%	84.7%

There is a number of implemented criteria in Ambrosia, but it must be remembered that each of them has its own prerequisites and each tests its own null hypothesis.

alternative and groups_ratio parameters are also available in the empirical approach

[27]:

designer.run(to_design='power',
             method='empiric',
             sizes=sizes,
             effects=effects,
             criterion='ttest',
             bs_samples=10000,
             alternative='greater',
             groups_ratio=2.0,
             n_jobs=4)

[27]:

	Group sizes	(1000, 2000)	(3000, 6000)	(7000, 14000)
$\alpha$	Effect
0.01	5.0%	3.6%	5.9%	10.7%
	10.0%	7.9%	16.4%	34.5%
	20.0%	21.9%	55.4%	88.0%
0.05	5.0%	12.5%	17.6%	25.9%
	10.0%	20.5%	37.1%	58.8%
	20.0%	44.4%	76.5%	96.4%

Note: The empirical approach consumes a significant amount of computing resources and memory, especially when calculations are made on large groups.

Stand-alone design function¶

There is a function that replicates the behavior of the Designer and it can also be used in the same way to calculate A/B test parameters

Let’s design test parameters for two metrics, we will get the output dict with pandas tables

[28]:

design_result = design(to_design='power',
                       dataframe=data,
                       metrics=['sum_dur', 'vod_cnt'],
                       method='theory',
                       first_type_errors=first_type_errors,
                       sizes=sizes,
                       effects=effects)

Theoretical design of power for sum_dur metric

[29]:

design_result['sum_dur']

[29]:

	Group sizes	1000	3000	7000
$\alpha$	Effect
0.01	5.0%	1.4%	2.2%	4.1%
	10.0%	2.7%	6.9%	18.3%
	20.0%	9.4%	34.8%	77.8%
0.05	5.0%	6.1%	8.5%	13.3%
	10.0%	9.7%	19.4%	38.6%
	20.0%	24.3%	59.0%	91.6%

Theoretical design of power for vod_cnt metric

[30]:

design_result['vod_cnt']

[30]:

	Group sizes	1000	3000	7000
$\alpha$	Effect
0.01	5.0%	2.3%	5.6%	14.4%
	10.0%	7.6%	27.5%	67.2%
	20.0%	38.5%	91.6%	100.0%
0.05	5.0%	8.8%	16.7%	32.7%
	10.0%	20.8%	50.7%	85.6%
	20.0%	62.7%	97.7%	100.0%

Storable configuration¶

The Designer class instance could be saved and created from a yaml config file. Attributes like datasets are not serialized and must be set after instanse is loaded.

Lets create an instance with preferred attributes

[31]:

store_path = '_examples_configs/designer_config.yaml'

[32]:

storable_designer = Designer(effects=[1.05, 1.1, 1.2],
                             sizes=[1000, 3000, 7000],
                             first_type_errors=[0.01, 0.05],
                             metrics=['sum_dur', 'ln_vod_cnt'])

[33]:

storable_designer.__getstate__()

[33]:

{'effects': [1.05, 1.1, 1.2],
 'sizes': [1000, 3000, 7000],
 'first_type_errors': [0.01, 0.05],
 'second_type_errors': [0.2],
 'metrics': ['sum_dur', 'ln_vod_cnt'],
 'method': 'theory'}

Save the config in a file

[34]:

with open(store_path, 'w') as outfile:
    yaml.dump(storable_designer, outfile, default_flow_style=True)

Load instance from a file and set data

[35]:

loaded_designer = load_from_config(store_path)
loaded_designer.set_dataframe(data)

[36]:

loaded_designer.__getstate__()

[36]:

{'effects': [1.05, 1.1, 1.2],
 'sizes': [1000, 3000, 7000],
 'first_type_errors': [0.01, 0.05],
 'second_type_errors': [0.2],
 'metrics': ['sum_dur', 'ln_vod_cnt'],
 'method': 'theory'}

Design some experiment parameter

[37]:

design_results = loaded_designer.run('power')

[38]:

design_results['sum_dur']

[38]:

	Group sizes	1000	3000	7000
$\alpha$	Effect
0.01	5.0%	1.4%	2.2%	4.1%
	10.0%	2.7%	6.9%	18.3%
	20.0%	9.4%	34.8%	77.8%
0.05	5.0%	6.1%	8.5%	13.3%
	10.0%	9.7%	19.4%	38.6%
	20.0%	24.3%	59.0%	91.6%

Learn more¶

There are a few more examples of designing experiment parameters with Ambrosia

Check:

Designer class documentation
An example of binary metrics experiment design
An example of designing parameters using Spark DataFrame (currently has limited functionality)
Habr post about Ambrosia

Example of the Designer class usage for A/B test parameters calculation¶