Example of the Designer class usage for A/B test parameters calculation¶
This tutorial will review Ambrosia’s experiments design tools using an example of calculating the parameters of a hypothetical A/B test. For this, synthetic data on MTS KION users metrics will be used.
Before we start looking at the tools, here is a short list of questions and answers to help understand some of the experiment design essentials
Note: In this tutorial fixed-horizon experiments are assumed. For this kind of experiments, decisions are made based on the results obtained at the end of the planned test duration.
What is needed before designing A/B test parameters?¶
Before the experiment it is good to have:
Formulated and fixed hypothesis
One or a set of metrics that meet all the requirements of the task, and on which the conclusion will be drawn
A fixed plan of the decision-making process based on the results of the experiment
Also, for a calculation of the test parameters itself, we need to have historical data on selected metrics.
What parameters does usual A/B test have?¶
In usual A/B test we use some statistical criteria that tests our hypothesis, and there are four related parameters for the experimental setup:
I type error (alpha) - probability of false success of the criterion in the absence of real changes, 1 - alpha is called statistical significance
II type error (beta) - probability of false failure of the criterion in the presence of real changes, 1 - beta is called statistical power
Groups sizes - number of objects in each experimental group, converted to the duration of the experiment using traffic
Minimal detectable effect (MDE) - is the smallest true effect from a change that has a certain level of statistical power for a certain level of statistical significance
Note: For tests with multigroups or sets of metrics, this must be taken into account in an appropriate way when calculating the parameters of the experiment.
Why one need to calculate A/B test parameters?¶
Basically, researches fix I type error at some level (industry default is 0.05) and try to maximize statistical power of test under the existing limitations of business environment .
These limitations usually include:
Test duration limitation due to risks of implemented change negative impact
Test duration limitation because of test costs
Group sizes limitation due to limits of available objects pool or traffic channel
MDE limitation due to it’s minimal reasonable size
MDE limitation due to weak impact of the change on the tested metric
MDE limitation due to development costs of implemented change
Costs of I and II type errors, which limits of fixes these values
How parameters can be calculated?¶
Ambrosia offers two approaches to calculate experiment parameters using metric historical data.
Theoretical approach
First method is based on the results of the analytical formula for the difference of normally distributed quantities. This method is very fast because it only requires the value of the mean and variance of the empirical distribution of the metric, and is recommended for first use.
Don’t worry if your metric isn’t distributed normally, for a large enough group the CLT will work for you. However, to obtain completely correct results, it is necessary to check the nominal coverage of the corresponding confidence intervals.
You can read more about this theoretical formula here or in other sources.
Empirical approach
The theoretical approach is fast and convenient, but does not take into account your specific criteria and all the features of the distribution of metric values.
The empirical method allows parameters to be calculated by repeatedly sampling the groups from the passed historical data, modeling the effect on the test group, and applying the selected statistical test on large number of group pairs. Thus, the statistical power can be estimated empirically and other parameters optimally matched.
This method is more computationally consuming and can give noisy results in parameters estimation with a small number of sampled groups.
binary method which solves inverse problem by constructing a large number of binary confidence intervals.Now, let’s start the tutorial¶
[2]:
import numpy as np
import pandas as pd
import yaml
from ambrosia.designer import Designer, design, load_from_config
Load data
[3]:
data = pd.read_csv('../tests/test_data/kion_data.csv', sep=';')
[4]:
data.head()
[4]:
| profile_id | sum_dur | vod_cnt | ln_vod_cnt | bin_col | |
|---|---|---|---|---|---|
| 0 | 99402893794 | 20104282 | 83 | 5.533356 | 1 |
| 1 | 878511937265 | 3986136 | 53 | 4.807294 | 1 |
| 2 | 998929369788 | 2063965 | 22 | 3.187069 | 1 |
| 3 | 265028786131 | 523539 | 14 | 2.679252 | 1 |
| 4 | 995182338752 | 1588224 | 19 | 4.177776 | 1 |
The Designer class is Ambrosia’s main tool for calculating experimental parameters. It has one main public method run() which returns the table with calculated parameters of the test.
Let’s create an instance of the class and pass to the constructor a dataframe with historical data about the metrics that we will design, in our case, this is the total duration of viewing the content sum_dur per user.
[5]:
designer = Designer(dataframe=data, metrics='sum_dur')
In fact, we can pass this dataframe and metrics later as an argument to the run() method. We can do the same with most of the parameters related directly to the experiment (errors, effects, and so on) - either pass them to the constructor during initialization (and then they will become attributes of the created instance), or pass them later, when execute run() method. In case of parameter selection ambiguity, the argument in the method takes precedence over the attribute value.
Theoretical design¶
Now we will calculate the parameters of the experiment using theoretical approach and grid of other known params
[6]:
### Set parameters grid
effects = [1.05, 1.1, 1.2] # MDE in percents
sizes = [1000, 3000, 7000] # Size of each group
first_type_errors = [0.01, 0.05]
second_type_errors = [0.1, 0.2]
Calculate MDE
[7]:
designer.run(to_design='effect',
method='theory',
first_type_errors=first_type_errors,
second_type_errors=second_type_errors,
sizes=sizes)
[7]:
| Errors ($\alpha$, $\beta$) | (0.01; 0.1) | (0.01; 0.2) | (0.05; 0.1) | (0.05; 0.2) |
|---|---|---|---|---|
| Group sizes | ||||
| 1000 | 61.1% | 54.2% | 51.4% | 44.4% |
| 3000 | 35.3% | 31.3% | 29.6% | 25.6% |
| 7000 | 23.1% | 20.5% | 19.4% | 16.8% |
We will use these error rates further, so let’s set them using setters
[8]:
designer.set_first_errors(first_type_errors)
designer.set_second_errors(second_type_errors)
Now calculate group sizes
[9]:
designer.run(to_design='size', method='theory', effects=effects)
[9]:
| Errors ($\alpha$, $\beta$) | (0.01; 0.1) | (0.01; 0.2) | (0.05; 0.1) | (0.05; 0.2) |
|---|---|---|---|---|
| Effect | ||||
| 5.0% | 149323 | 117206 | 105448 | 78768 |
| 10.0% | 37332 | 29303 | 26363 | 19693 |
| 20.0% | 9335 | 7327 | 6592 | 4924 |
Finally calculate statistical power
[10]:
designer.run(to_design='power', method='theory', effects=effects, sizes=sizes)
[10]:
| Group sizes | 1000 | 3000 | 7000 | |
|---|---|---|---|---|
| $\alpha$ | Effect | |||
| 0.01 | 5.0% | 1.4% | 2.2% | 4.1% |
| 10.0% | 2.7% | 6.9% | 18.3% | |
| 20.0% | 9.4% | 34.8% | 77.8% | |
| 0.05 | 5.0% | 6.1% | 8.5% | 13.3% |
| 10.0% | 9.7% | 19.4% | 38.6% | |
| 20.0% | 24.3% | 59.0% | 91.6% |
We can change alternative, by default it is "two-sided", now we want test only positive changes
[11]:
designer.run(to_design='power',
method='theory',
effects=effects,
sizes=sizes,
alternative='greater')
[11]:
| Group sizes | 1000 | 3000 | 7000 | |
|---|---|---|---|---|
| $\alpha$ | Effect | |||
| 0.01 | 5.0% | 2.2% | 3.8% | 6.8% |
| 10.0% | 4.5% | 10.9% | 25.6% | |
| 20.0% | 14.4% | 44.4% | 84.5% | |
| 0.05 | 5.0% | 9.2% | 13.6% | 20.9% |
| 10.0% | 15.5% | 29.1% | 51.0% | |
| 20.0% | 35.1% | 70.6% | 95.5% |
Parameter groups_ratio allows to make groups sizes unequal. The size of group B is equal to the size of group A multiplied by groups_ratio value. By default, it is equal to 1.0.
Let’s make calculation of required size for group A : group B in proportion of 10 : 1. The output group size calculation results show us the size of group A
[12]:
designer.run(to_design='size',
method='theory',
effects=effects,
sizes=sizes,
groups_ratio=0.1)
[12]:
| Errors ($\alpha$, $\beta$) | (0.01; 0.1) | (0.01; 0.2) | (0.05; 0.1) | (0.05; 0.2) |
|---|---|---|---|---|
| Effect | ||||
| 5.0% | 821269 | 644622 | 579958 | 433219 |
| 10.0% | 205320 | 161158 | 144991 | 108306 |
| 20.0% | 51333 | 40292 | 36249 | 27078 |
Empirical design¶
Now we will change design method to empiric and calculate group sizes by conducting a lot of pseudo A/B tests on historical data.
As a default statistical criterion, the Designer uses the two-sample independent T-test.
To limit computational cost we will set bs_samples parameter to a low value. This parameter determines how many pseudo A/B tests we will conduct to evaluate one value of the parameter, and high values (use at least >1000) will give more accurate estimation of parameters.
We will also use multiprocessing to speed up calculations and set the value of n_jobs to 4 (by default it is equal to 1).
[13]:
designer.run(to_design='size',
method='empiric',
effects=effects,
bs_samples=100,
n_jobs=4)
[13]:
| Errors ($\alpha$, $\beta$) | (0.01, 0.1) | (0.01, 0.2) | (0.05, 0.1) | (0.05, 0.2) |
|---|---|---|---|---|
| Effect | ||||
| 5.0% | 153569 | 137126 | 117300 | 73706 |
| 10.0% | 41096 | 34920 | 27711 | 21503 |
| 20.0% | 10299 | 8827 | 7639 | 5822 |
Statistical criterion can be changed using corresponding parameter criterion
[14]:
designer.run(to_design='size',
method='empiric',
effects=effects,
criterion='mw',
bs_samples=100,
n_jobs=4)
[14]:
| Errors ($\alpha$, $\beta$) | (0.01, 0.1) | (0.01, 0.2) | (0.05, 0.1) | (0.05, 0.2) |
|---|---|---|---|---|
| Effect | ||||
| 5.0% | 66810 | 58589 | 46579 | 39088 |
| 10.0% | 15249 | 12426 | 10748 | 10069 |
| 20.0% | 4247 | 3891 | 2979 | 2340 |
We can use bootstrap criterion to calculate some parameter
[25]:
designer.run(to_design='power',
method='empiric',
effects=effects,
sizes=sizes,
criterion='bootstrap',
bs_samples=1000,
n_jobs=4)
[25]:
| Group sizes | (1000, 1000) | (3000, 3000) | (7000, 7000) | |
|---|---|---|---|---|
| $\alpha$ | Effect | |||
| 0.01 | 5.0% | 2.2% | 3.9% | 5.1% |
| 10.0% | 4.5% | 7.8% | 19.7% | |
| 20.0% | 9.9% | 31.5% | 70.8% | |
| 0.05 | 5.0% | 7.6% | 9.8% | 13.9% |
| 10.0% | 10.6% | 19.7% | 37.5% | |
| 20.0% | 24.4% | 57.1% | 84.7% |
There is a number of implemented criteria in Ambrosia, but it must be remembered that each of them has its own prerequisites and each tests its own null hypothesis.
alternative and groups_ratio parameters are also available in the empirical approach
[27]:
designer.run(to_design='power',
method='empiric',
sizes=sizes,
effects=effects,
criterion='ttest',
bs_samples=10000,
alternative='greater',
groups_ratio=2.0,
n_jobs=4)
[27]:
| Group sizes | (1000, 2000) | (3000, 6000) | (7000, 14000) | |
|---|---|---|---|---|
| $\alpha$ | Effect | |||
| 0.01 | 5.0% | 3.6% | 5.9% | 10.7% |
| 10.0% | 7.9% | 16.4% | 34.5% | |
| 20.0% | 21.9% | 55.4% | 88.0% | |
| 0.05 | 5.0% | 12.5% | 17.6% | 25.9% |
| 10.0% | 20.5% | 37.1% | 58.8% | |
| 20.0% | 44.4% | 76.5% | 96.4% |
Note: The empirical approach consumes a significant amount of computing resources and memory, especially when calculations are made on large groups.
Stand-alone design function¶
There is a function that replicates the behavior of the Designer and it can also be used in the same way to calculate A/B test parameters
Let’s design test parameters for two metrics, we will get the output dict with pandas tables
[28]:
design_result = design(to_design='power',
dataframe=data,
metrics=['sum_dur', 'vod_cnt'],
method='theory',
first_type_errors=first_type_errors,
sizes=sizes,
effects=effects)
Theoretical design of power for sum_dur metric
[29]:
design_result['sum_dur']
[29]:
| Group sizes | 1000 | 3000 | 7000 | |
|---|---|---|---|---|
| $\alpha$ | Effect | |||
| 0.01 | 5.0% | 1.4% | 2.2% | 4.1% |
| 10.0% | 2.7% | 6.9% | 18.3% | |
| 20.0% | 9.4% | 34.8% | 77.8% | |
| 0.05 | 5.0% | 6.1% | 8.5% | 13.3% |
| 10.0% | 9.7% | 19.4% | 38.6% | |
| 20.0% | 24.3% | 59.0% | 91.6% |
Theoretical design of power for vod_cnt metric
[30]:
design_result['vod_cnt']
[30]:
| Group sizes | 1000 | 3000 | 7000 | |
|---|---|---|---|---|
| $\alpha$ | Effect | |||
| 0.01 | 5.0% | 2.3% | 5.6% | 14.4% |
| 10.0% | 7.6% | 27.5% | 67.2% | |
| 20.0% | 38.5% | 91.6% | 100.0% | |
| 0.05 | 5.0% | 8.8% | 16.7% | 32.7% |
| 10.0% | 20.8% | 50.7% | 85.6% | |
| 20.0% | 62.7% | 97.7% | 100.0% |
Storable configuration¶
The Designer class instance could be saved and created from a yaml config file. Attributes like datasets are not serialized and must be set after instanse is loaded.
Lets create an instance with preferred attributes
[31]:
store_path = '_examples_configs/designer_config.yaml'
[32]:
storable_designer = Designer(effects=[1.05, 1.1, 1.2],
sizes=[1000, 3000, 7000],
first_type_errors=[0.01, 0.05],
metrics=['sum_dur', 'ln_vod_cnt'])
[33]:
storable_designer.__getstate__()
[33]:
{'effects': [1.05, 1.1, 1.2],
'sizes': [1000, 3000, 7000],
'first_type_errors': [0.01, 0.05],
'second_type_errors': [0.2],
'metrics': ['sum_dur', 'ln_vod_cnt'],
'method': 'theory'}
Save the config in a file
[34]:
with open(store_path, 'w') as outfile:
yaml.dump(storable_designer, outfile, default_flow_style=True)
Load instance from a file and set data
[35]:
loaded_designer = load_from_config(store_path)
loaded_designer.set_dataframe(data)
[36]:
loaded_designer.__getstate__()
[36]:
{'effects': [1.05, 1.1, 1.2],
'sizes': [1000, 3000, 7000],
'first_type_errors': [0.01, 0.05],
'second_type_errors': [0.2],
'metrics': ['sum_dur', 'ln_vod_cnt'],
'method': 'theory'}
Design some experiment parameter
[37]:
design_results = loaded_designer.run('power')
[38]:
design_results['sum_dur']
[38]:
| Group sizes | 1000 | 3000 | 7000 | |
|---|---|---|---|---|
| $\alpha$ | Effect | |||
| 0.01 | 5.0% | 1.4% | 2.2% | 4.1% |
| 10.0% | 2.7% | 6.9% | 18.3% | |
| 20.0% | 9.4% | 34.8% | 77.8% | |
| 0.05 | 5.0% | 6.1% | 8.5% | 13.3% |
| 10.0% | 9.7% | 19.4% | 38.6% | |
| 20.0% | 24.3% | 59.0% | 91.6% |
Learn more¶
There are a few more examples of designing experiment parameters with Ambrosia
Check:
Designerclass documentationAn example of binary metrics experiment design
An example of designing parameters using Spark DataFrame (currently has limited functionality)