Example of the Tester class usage for evaluation of the effect

This tutorial will show how Amrosia testing tools can be used to create statistical evaluation of the effects in the experiments.

Usually when we make statistical evaluation, we have pre-selected statistical criteria and first error decision threshold on the experiment design stage.

The experimenters compare the p-value obtained after the experiment with the first error threshold and obtain a point estimate of the effect with constructed confidence intervals.

Further we will observe all these steps using Tester class on the synthetically generated data.

Let’s start the tutorial

[2]:
import pandas as pd
import numpy as np

from ambrosia.tester import Tester

Load data

[3]:
data = pd.read_csv('../tests/test_data/watch_result_agg.csv')

There is some data on users content views, which was aggregated during the experiment, and we have two groups.

[4]:
data.head()
[4]:
id watched group
0 6 597.833362 A
1 11 549.314234 A
2 20 564.401942 A
3 21 248.735358 A
4 23 926.048946 B

All what is needed for the effect estimation is inside Tester class. It has one main public method run() which returns the table with a p-value, point effect and cinfidence interval.

The Splitter class is Ambrosia’s main tool for splitting objects into the creating groups.

Let’s create an instance of the class and pass to the constructor experimental data and the name of group columns, and arguments that we defined during the design stage

[5]:
tester = Tester(dataframe=data,
                column_groups='group',
                metrics='watched',
                first_type_errors=0.01)

Now we will call run() method to estimate absolute uplift using t-test criterion

[6]:
tester.run(effect_type='absolute',
           method='theory',
           criterion='ttest',
          )
[6]:
first_type_error pvalue effect confidence_interval metric name group A label group B label
0 0.01 0.000022 55.314679 (14.578, 96.0514) watched A B

We can also estimate relative effect

[7]:
tester.run(effect_type='relative',
           method='theory',
           criterion='ttest',
          )
[7]:
first_type_error pvalue effect confidence_interval metric name group A label group B label
0 0.01 0.00004 0.079901 (0.0299, 0.1303) watched A B

Change alternative from "two-sided" to "greater"

[8]:
tester.run(
    effect_type='relative',
    method='theory',
    criterion='ttest',
    alternative='greater',
)
[8]:
first_type_error pvalue effect confidence_interval metric name group A label group B label
0 0.01 0.00002 0.079901 (0.0347, inf) watched A B

Change criterion to Mann–Whitney test

[9]:
tester.run(effect_type='absolute',
           method='theory',
           criterion='mw',
           metrics='watched',
           first_type_errors=0.01)
[9]:
first_type_error pvalue effect confidence_interval metric name group A label group B label
0 0.01 0.000035 43.598116 (None, None) watched A B

Use bootstrap criteria by changing method to "empiric"

[10]:
tester.run(effect_type='absolute',
           method='empiric',
           metrics='watched',
           first_type_errors=0.01)
[10]:
first_type_error pvalue effect confidence_interval metric name group A label group B label
0 0.01 3.552714e-15 55.314679 (21.2797, 88.1704) watched A B

If we want to make evaluation binary values, like conversion, method should be changed to "binary"

Multiple hypothesis correction

Tester has ability to apply MHC to p-value and confidence intervals. Total number of hypothesis is equal to the number of groups combinations multiplied by the number of metrics passed.

By the default Bonferroni correction is applied, but it can be turned off by passing None argument to the correction_method.

Let’s create number of synthetic groups and look at the Tester behavior

[11]:
total_size = 1000
groups = ['A', 'B', 'C', "D"]
[12]:
np.random.seed(42)
multi_groups_result = pd.DataFrame(np.random.normal(size=(total_size, 2)),
                                   columns=['metric_1', 'metric_2'])
multi_groups_result['groups'] = np.random.choice(groups, size=total_size)
multi_groups_result = multi_groups_result.sort_values('groups')
[13]:
multi_tester = Tester(dataframe=multi_groups_result,
                      column_groups='groups',
                      metrics=['metric_1', 'metric_2'])

Here we have 6 unique pairs to test and two metrics, so due to Bonferroni correction the p-values will reduced by 12 times and CI’s will be increased to corresponding values

[14]:
multi_tester.run(method='theory')
[14]:
first_type_error pvalue effect confidence_interval metric name group A label group B label
0 0.05 1.0 -0.084442 (-0.3213, 0.1524) metric_1 A B
1 0.05 1.0 -0.102428 (-0.3644, 0.1595) metric_2 A B
2 0.05 1.0 0.028641 (-0.2191, 0.2764) metric_1 A C
3 0.05 1.0 -0.142255 (-0.4022, 0.1176) metric_2 A C
4 0.05 1.0 0.050312 (-0.1946, 0.2952) metric_1 A D
5 0.05 1.0 -0.063565 (-0.3157, 0.1885) metric_2 A D
6 0.05 1.0 0.113082 (-0.1351, 0.3613) metric_1 B C
7 0.05 1.0 -0.039827 (-0.3085, 0.2289) metric_2 B C
8 0.05 1.0 0.134753 (-0.1107, 0.3802) metric_1 B D
9 0.05 1.0 0.038863 (-0.2223, 0.3) metric_2 B D
10 0.05 1.0 0.021671 (-0.2342, 0.2776) metric_1 C D
11 0.05 1.0 0.078690 (-0.1804, 0.3378) metric_2 C D

When we deny the MHC

[15]:
multi_tester.run(method='theory', correction_method=None)
[15]:
first_type_error pvalue effect confidence_interval metric name group A label group B label
0 0.05 0.307529 -0.084442 (-0.2465, 0.0776) metric_1 A B
1 0.05 0.263036 -0.102428 (-0.2816, 0.0767) metric_2 A B
2 0.05 0.740575 0.028641 (-0.1408, 0.1981) metric_1 A C
3 0.05 0.117435 -0.142255 (-0.32, 0.0355) metric_2 A C
4 0.05 0.556405 0.050312 (-0.1172, 0.2179) metric_1 A D
5 0.05 0.470332 -0.063565 (-0.236, 0.1089) metric_2 A D
6 0.05 0.192379 0.113082 (-0.0567, 0.2829) metric_1 B C
7 0.05 0.671231 -0.039827 (-0.2236, 0.144) metric_2 B C
8 0.05 0.116301 0.134753 (-0.0331, 0.3026) metric_1 B D
9 0.05 0.670008 0.038863 (-0.1398, 0.2175) metric_2 B D
10 0.05 0.808385 0.021671 (-0.1534, 0.1967) metric_1 C D
11 0.05 0.384652 0.078690 (-0.0986, 0.2559) metric_2 C D

Learn more

There is some more information on evaluating the effect of experiments using Ambrosia

Check:

  • Tester class documentation

  • An example of making statistical inference and effect estimation on Spark data