Example of the Designer class usage for A/B test parameters calculation

This tutorial will review Ambrosia’s experiments design tools using an example of calculating the parameters of a hypothetical A/B test. For this, synthetic data on MTS KION users metrics will be used.

Before we start looking at the tools, here is a short list of questions and answers to help understand some of the experiment design essentials

Note: In this tutorial fixed-horizon experiments are assumed. For this kind of experiments, decisions are made based on the results obtained at the end of the planned test duration.

What is needed before designing A/B test parameters?

Before the experiment it is good to have:

  • Formulated and fixed hypothesis

  • One or a set of metrics that meet all the requirements of the task, and on which the conclusion will be drawn

  • A fixed plan of the decision-making process based on the results of the experiment

Also, for a calculation of the test parameters itself, we need to have historical data on selected metrics.

What parameters does usual A/B test have?

In usual A/B test we use some statistical criteria that tests our hypothesis, and there are four related parameters for the experimental setup:

  • I type error (alpha) - probability of false success of the criterion in the absence of real changes, 1 - alpha is called statistical significance

  • II type error (beta) - probability of false failure of the criterion in the presence of real changes, 1 - beta is called statistical power

  • Groups sizes - number of objects in each experimental group, converted to the duration of the experiment using traffic

  • Minimal detectable effect (MDE) - is the smallest true effect from a change that has a certain level of statistical power for a certain level of statistical significance

Note: For tests with multigroups or sets of metrics, this must be taken into account in an appropriate way when calculating the parameters of the experiment.

Why one need to calculate A/B test parameters?

This is necessary to obtain correct and expected results from the experiment.
Nobody wants to run an experiment longer than necessary or get results with low statistical power.

Basically, researches fix I type error at some level (industry default is 0.05) and try to maximize statistical power of test under the existing limitations of business environment .

These limitations usually include:

  • Test duration limitation due to risks of implemented change negative impact

  • Test duration limitation because of test costs

  • Group sizes limitation due to limits of available objects pool or traffic channel

  • MDE limitation due to it’s minimal reasonable size

  • MDE limitation due to weak impact of the change on the tested metric

  • MDE limitation due to development costs of implemented change

  • Costs of I and II type errors, which limits of fixes these values

For example, there may be such a statement of the design problem:
What is the minimal detectable effect on the metric that we can detect with given errors of type I and II, if the size of our groups is fixed?

How parameters can be calculated?

Ambrosia offers two approaches to calculate experiment parameters using metric historical data.

Theoretical approach

First method is based on the results of the analytical formula for the difference of normally distributed quantities. This method is very fast because it only requires the value of the mean and variance of the empirical distribution of the metric, and is recommended for first use.

Don’t worry if your metric isn’t distributed normally, for a large enough group the CLT will work for you. However, to obtain completely correct results, it is necessary to check the nominal coverage of the corresponding confidence intervals.

You can read more about this theoretical formula here or in other sources.

Empirical approach

The theoretical approach is fast and convenient, but does not take into account your specific criteria and all the features of the distribution of metric values.

The empirical method allows parameters to be calculated by repeatedly sampling the groups from the passed historical data, modeling the effect on the test group, and applying the selected statistical test on large number of group pairs. Thus, the statistical power can be estimated empirically and other parameters optimally matched.

This method is more computationally consuming and can give noisy results in parameters estimation with a small number of sampled groups.

Note: For binary metrics empirical approach is not suitable. You can choose binary method which solves inverse problem by constructing a large number of binary confidence intervals.
This method has its own features, see a separate example with the design of binary metrics.

Now, let’s start the tutorial

[2]:
import numpy as np
import pandas as pd

import yaml

from ambrosia.designer import Designer, design, load_from_config

Load data

[3]:
data = pd.read_csv('../tests/test_data/kion_data.csv', sep=';')
[4]:
data.head()
[4]:
profile_id sum_dur vod_cnt ln_vod_cnt bin_col
0 99402893794 20104282 83 5.533356 1
1 878511937265 3986136 53 4.807294 1
2 998929369788 2063965 22 3.187069 1
3 265028786131 523539 14 2.679252 1
4 995182338752 1588224 19 4.177776 1

The Designer class is Ambrosia’s main tool for calculating experimental parameters. It has one main public method run() which returns the table with calculated parameters of the test.

Let’s create an instance of the class and pass to the constructor a dataframe with historical data about the metrics that we will design, in our case, this is the total duration of viewing the content sum_dur per user.

[5]:
designer = Designer(dataframe=data, metrics='sum_dur')

In fact, we can pass this dataframe and metrics later as an argument to the run() method. We can do the same with most of the parameters related directly to the experiment (errors, effects, and so on) - either pass them to the constructor during initialization (and then they will become attributes of the created instance), or pass them later, when execute run() method. In case of parameter selection ambiguity, the argument in the method takes precedence over the attribute value.

Theoretical design

Now we will calculate the parameters of the experiment using theoretical approach and grid of other known params

[6]:
### Set parameters grid
effects = [1.05, 1.1, 1.2]  # MDE in percents
sizes = [1000, 3000, 7000]  # Size of each group
first_type_errors = [0.01, 0.05]
second_type_errors = [0.1, 0.2]

Calculate MDE

[7]:
designer.run(to_design='effect',
             method='theory',
             first_type_errors=first_type_errors,
             second_type_errors=second_type_errors,
             sizes=sizes)
[7]:
Errors ($\alpha$, $\beta$) (0.01; 0.1) (0.01; 0.2) (0.05; 0.1) (0.05; 0.2)
Group sizes
1000 61.1% 54.2% 51.4% 44.4%
3000 35.3% 31.3% 29.6% 25.6%
7000 23.1% 20.5% 19.4% 16.8%

We will use these error rates further, so let’s set them using setters

[8]:
designer.set_first_errors(first_type_errors)
designer.set_second_errors(second_type_errors)

Now calculate group sizes

[9]:
designer.run(to_design='size', method='theory', effects=effects)
[9]:
Errors ($\alpha$, $\beta$) (0.01; 0.1) (0.01; 0.2) (0.05; 0.1) (0.05; 0.2)
Effect
5.0% 149323 117206 105448 78768
10.0% 37332 29303 26363 19693
20.0% 9335 7327 6592 4924

Finally calculate statistical power

[10]:
designer.run(to_design='power', method='theory', effects=effects, sizes=sizes)
[10]:
Group sizes 1000 3000 7000
$\alpha$ Effect
0.01 5.0% 1.4% 2.2% 4.1%
10.0% 2.7% 6.9% 18.3%
20.0% 9.4% 34.8% 77.8%
0.05 5.0% 6.1% 8.5% 13.3%
10.0% 9.7% 19.4% 38.6%
20.0% 24.3% 59.0% 91.6%

We can change alternative, by default it is "two-sided", now we want test only positive changes

[11]:
designer.run(to_design='power',
             method='theory',
             effects=effects,
             sizes=sizes,
             alternative='greater')
[11]:
Group sizes 1000 3000 7000
$\alpha$ Effect
0.01 5.0% 2.2% 3.8% 6.8%
10.0% 4.5% 10.9% 25.6%
20.0% 14.4% 44.4% 84.5%
0.05 5.0% 9.2% 13.6% 20.9%
10.0% 15.5% 29.1% 51.0%
20.0% 35.1% 70.6% 95.5%

Parameter groups_ratio allows to make groups sizes unequal. The size of group B is equal to the size of group A multiplied by groups_ratio value. By default, it is equal to 1.0.

Let’s make calculation of required size for group A : group B in proportion of 10 : 1. The output group size calculation results show us the size of group A

[12]:
designer.run(to_design='size',
             method='theory',
             effects=effects,
             sizes=sizes,
             groups_ratio=0.1)
[12]:
Errors ($\alpha$, $\beta$) (0.01; 0.1) (0.01; 0.2) (0.05; 0.1) (0.05; 0.2)
Effect
5.0% 821269 644622 579958 433219
10.0% 205320 161158 144991 108306
20.0% 51333 40292 36249 27078

Empirical design

Now we will change design method to empiric and calculate group sizes by conducting a lot of pseudo A/B tests on historical data.

As a default statistical criterion, the Designer uses the two-sample independent T-test.

To limit computational cost we will set bs_samples parameter to a low value. This parameter determines how many pseudo A/B tests we will conduct to evaluate one value of the parameter, and high values (use at least >1000) will give more accurate estimation of parameters.

We will also use multiprocessing to speed up calculations and set the value of n_jobs to 4 (by default it is equal to 1).

[13]:
designer.run(to_design='size',
             method='empiric',
             effects=effects,
             bs_samples=100,
             n_jobs=4)
[13]:
Errors ($\alpha$, $\beta$) (0.01, 0.1) (0.01, 0.2) (0.05, 0.1) (0.05, 0.2)
Effect
5.0% 153569 137126 117300 73706
10.0% 41096 34920 27711 21503
20.0% 10299 8827 7639 5822

Statistical criterion can be changed using corresponding parameter criterion

[14]:
designer.run(to_design='size',
             method='empiric',
             effects=effects,
             criterion='mw',
             bs_samples=100,
             n_jobs=4)
[14]:
Errors ($\alpha$, $\beta$) (0.01, 0.1) (0.01, 0.2) (0.05, 0.1) (0.05, 0.2)
Effect
5.0% 66810 58589 46579 39088
10.0% 15249 12426 10748 10069
20.0% 4247 3891 2979 2340

We can use bootstrap criterion to calculate some parameter

[25]:
designer.run(to_design='power',
             method='empiric',
             effects=effects,
             sizes=sizes,
             criterion='bootstrap',
             bs_samples=1000,
             n_jobs=4)
[25]:
Group sizes (1000, 1000) (3000, 3000) (7000, 7000)
$\alpha$ Effect
0.01 5.0% 2.2% 3.9% 5.1%
10.0% 4.5% 7.8% 19.7%
20.0% 9.9% 31.5% 70.8%
0.05 5.0% 7.6% 9.8% 13.9%
10.0% 10.6% 19.7% 37.5%
20.0% 24.4% 57.1% 84.7%

There is a number of implemented criteria in Ambrosia, but it must be remembered that each of them has its own prerequisites and each tests its own null hypothesis.

alternative and groups_ratio parameters are also available in the empirical approach

[27]:
designer.run(to_design='power',
             method='empiric',
             sizes=sizes,
             effects=effects,
             criterion='ttest',
             bs_samples=10000,
             alternative='greater',
             groups_ratio=2.0,
             n_jobs=4)
[27]:
Group sizes (1000, 2000) (3000, 6000) (7000, 14000)
$\alpha$ Effect
0.01 5.0% 3.6% 5.9% 10.7%
10.0% 7.9% 16.4% 34.5%
20.0% 21.9% 55.4% 88.0%
0.05 5.0% 12.5% 17.6% 25.9%
10.0% 20.5% 37.1% 58.8%
20.0% 44.4% 76.5% 96.4%

Note: The empirical approach consumes a significant amount of computing resources and memory, especially when calculations are made on large groups.

Stand-alone design function

There is a function that replicates the behavior of the Designer and it can also be used in the same way to calculate A/B test parameters

Let’s design test parameters for two metrics, we will get the output dict with pandas tables

[28]:
design_result = design(to_design='power',
                       dataframe=data,
                       metrics=['sum_dur', 'vod_cnt'],
                       method='theory',
                       first_type_errors=first_type_errors,
                       sizes=sizes,
                       effects=effects)

Theoretical design of power for sum_dur metric

[29]:
design_result['sum_dur']
[29]:
Group sizes 1000 3000 7000
$\alpha$ Effect
0.01 5.0% 1.4% 2.2% 4.1%
10.0% 2.7% 6.9% 18.3%
20.0% 9.4% 34.8% 77.8%
0.05 5.0% 6.1% 8.5% 13.3%
10.0% 9.7% 19.4% 38.6%
20.0% 24.3% 59.0% 91.6%

Theoretical design of power for vod_cnt metric

[30]:
design_result['vod_cnt']
[30]:
Group sizes 1000 3000 7000
$\alpha$ Effect
0.01 5.0% 2.3% 5.6% 14.4%
10.0% 7.6% 27.5% 67.2%
20.0% 38.5% 91.6% 100.0%
0.05 5.0% 8.8% 16.7% 32.7%
10.0% 20.8% 50.7% 85.6%
20.0% 62.7% 97.7% 100.0%

Storable configuration

The Designer class instance could be saved and created from a yaml config file. Attributes like datasets are not serialized and must be set after instanse is loaded.

Lets create an instance with preferred attributes

[31]:
store_path = '_examples_configs/designer_config.yaml'
[32]:
storable_designer = Designer(effects=[1.05, 1.1, 1.2],
                             sizes=[1000, 3000, 7000],
                             first_type_errors=[0.01, 0.05],
                             metrics=['sum_dur', 'ln_vod_cnt'])
[33]:
storable_designer.__getstate__()
[33]:
{'effects': [1.05, 1.1, 1.2],
 'sizes': [1000, 3000, 7000],
 'first_type_errors': [0.01, 0.05],
 'second_type_errors': [0.2],
 'metrics': ['sum_dur', 'ln_vod_cnt'],
 'method': 'theory'}

Save the config in a file

[34]:
with open(store_path, 'w') as outfile:
    yaml.dump(storable_designer, outfile, default_flow_style=True)

Load instance from a file and set data

[35]:
loaded_designer = load_from_config(store_path)
loaded_designer.set_dataframe(data)
[36]:
loaded_designer.__getstate__()
[36]:
{'effects': [1.05, 1.1, 1.2],
 'sizes': [1000, 3000, 7000],
 'first_type_errors': [0.01, 0.05],
 'second_type_errors': [0.2],
 'metrics': ['sum_dur', 'ln_vod_cnt'],
 'method': 'theory'}

Design some experiment parameter

[37]:
design_results = loaded_designer.run('power')
[38]:
design_results['sum_dur']
[38]:
Group sizes 1000 3000 7000
$\alpha$ Effect
0.01 5.0% 1.4% 2.2% 4.1%
10.0% 2.7% 6.9% 18.3%
20.0% 9.4% 34.8% 77.8%
0.05 5.0% 6.1% 8.5% 13.3%
10.0% 9.7% 19.4% 38.6%
20.0% 24.3% 59.0% 91.6%

Learn more

There are a few more examples of designing experiment parameters with Ambrosia

Check:

  • Designer class documentation

  • An example of binary metrics experiment design

  • An example of designing parameters using Spark DataFrame (currently has limited functionality)

  • Habr post about Ambrosia