Example of the Tester class usage for evaluation of the effect¶
This tutorial will show how Amrosia testing tools can be used to create statistical evaluation of the effects in the experiments.
Usually when we make statistical evaluation, we have pre-selected statistical criteria and first error decision threshold on the experiment design stage.
The experimenters compare the p-value obtained after the experiment with the first error threshold and obtain a point estimate of the effect with constructed confidence intervals.
Further we will observe all these steps using Tester class on the synthetically generated data.
Let’s start the tutorial¶
[2]:
import pandas as pd
import numpy as np
from ambrosia.tester import Tester
Load data
[3]:
data = pd.read_csv('../tests/test_data/watch_result_agg.csv')
There is some data on users content views, which was aggregated during the experiment, and we have two groups.
[4]:
data.head()
[4]:
| id | watched | group | |
|---|---|---|---|
| 0 | 6 | 597.833362 | A |
| 1 | 11 | 549.314234 | A |
| 2 | 20 | 564.401942 | A |
| 3 | 21 | 248.735358 | A |
| 4 | 23 | 926.048946 | B |
All what is needed for the effect estimation is inside Tester class. It has one main public method run() which returns the table with a p-value, point effect and cinfidence interval.
The Splitter class is Ambrosia’s main tool for splitting objects into the creating groups.
Let’s create an instance of the class and pass to the constructor experimental data and the name of group columns, and arguments that we defined during the design stage
[5]:
tester = Tester(dataframe=data,
column_groups='group',
metrics='watched',
first_type_errors=0.01)
Now we will call run() method to estimate absolute uplift using t-test criterion
[6]:
tester.run(effect_type='absolute',
method='theory',
criterion='ttest',
)
[6]:
| first_type_error | pvalue | effect | confidence_interval | metric name | group A label | group B label | |
|---|---|---|---|---|---|---|---|
| 0 | 0.01 | 0.000022 | 55.314679 | (14.578, 96.0514) | watched | A | B |
We can also estimate relative effect
[7]:
tester.run(effect_type='relative',
method='theory',
criterion='ttest',
)
[7]:
| first_type_error | pvalue | effect | confidence_interval | metric name | group A label | group B label | |
|---|---|---|---|---|---|---|---|
| 0 | 0.01 | 0.00004 | 0.079901 | (0.0299, 0.1303) | watched | A | B |
Change alternative from "two-sided" to "greater"
[8]:
tester.run(
effect_type='relative',
method='theory',
criterion='ttest',
alternative='greater',
)
[8]:
| first_type_error | pvalue | effect | confidence_interval | metric name | group A label | group B label | |
|---|---|---|---|---|---|---|---|
| 0 | 0.01 | 0.00002 | 0.079901 | (0.0347, inf) | watched | A | B |
Change criterion to Mann–Whitney test
[9]:
tester.run(effect_type='absolute',
method='theory',
criterion='mw',
metrics='watched',
first_type_errors=0.01)
[9]:
| first_type_error | pvalue | effect | confidence_interval | metric name | group A label | group B label | |
|---|---|---|---|---|---|---|---|
| 0 | 0.01 | 0.000035 | 43.598116 | (None, None) | watched | A | B |
Use bootstrap criteria by changing method to "empiric"
[10]:
tester.run(effect_type='absolute',
method='empiric',
metrics='watched',
first_type_errors=0.01)
[10]:
| first_type_error | pvalue | effect | confidence_interval | metric name | group A label | group B label | |
|---|---|---|---|---|---|---|---|
| 0 | 0.01 | 3.552714e-15 | 55.314679 | (21.2797, 88.1704) | watched | A | B |
If we want to make evaluation binary values, like conversion, method should be changed to "binary"
Multiple hypothesis correction¶
Tester has ability to apply MHC to p-value and confidence intervals. Total number of hypothesis is equal to the number of groups combinations multiplied by the number of metrics passed.
By the default Bonferroni correction is applied, but it can be turned off by passing None argument to the correction_method.
Let’s create number of synthetic groups and look at the Tester behavior
[11]:
total_size = 1000
groups = ['A', 'B', 'C', "D"]
[12]:
np.random.seed(42)
multi_groups_result = pd.DataFrame(np.random.normal(size=(total_size, 2)),
columns=['metric_1', 'metric_2'])
multi_groups_result['groups'] = np.random.choice(groups, size=total_size)
multi_groups_result = multi_groups_result.sort_values('groups')
[13]:
multi_tester = Tester(dataframe=multi_groups_result,
column_groups='groups',
metrics=['metric_1', 'metric_2'])
Here we have 6 unique pairs to test and two metrics, so due to Bonferroni correction the p-values will reduced by 12 times and CI’s will be increased to corresponding values
[14]:
multi_tester.run(method='theory')
[14]:
| first_type_error | pvalue | effect | confidence_interval | metric name | group A label | group B label | |
|---|---|---|---|---|---|---|---|
| 0 | 0.05 | 1.0 | -0.084442 | (-0.3213, 0.1524) | metric_1 | A | B |
| 1 | 0.05 | 1.0 | -0.102428 | (-0.3644, 0.1595) | metric_2 | A | B |
| 2 | 0.05 | 1.0 | 0.028641 | (-0.2191, 0.2764) | metric_1 | A | C |
| 3 | 0.05 | 1.0 | -0.142255 | (-0.4022, 0.1176) | metric_2 | A | C |
| 4 | 0.05 | 1.0 | 0.050312 | (-0.1946, 0.2952) | metric_1 | A | D |
| 5 | 0.05 | 1.0 | -0.063565 | (-0.3157, 0.1885) | metric_2 | A | D |
| 6 | 0.05 | 1.0 | 0.113082 | (-0.1351, 0.3613) | metric_1 | B | C |
| 7 | 0.05 | 1.0 | -0.039827 | (-0.3085, 0.2289) | metric_2 | B | C |
| 8 | 0.05 | 1.0 | 0.134753 | (-0.1107, 0.3802) | metric_1 | B | D |
| 9 | 0.05 | 1.0 | 0.038863 | (-0.2223, 0.3) | metric_2 | B | D |
| 10 | 0.05 | 1.0 | 0.021671 | (-0.2342, 0.2776) | metric_1 | C | D |
| 11 | 0.05 | 1.0 | 0.078690 | (-0.1804, 0.3378) | metric_2 | C | D |
When we deny the MHC
[15]:
multi_tester.run(method='theory', correction_method=None)
[15]:
| first_type_error | pvalue | effect | confidence_interval | metric name | group A label | group B label | |
|---|---|---|---|---|---|---|---|
| 0 | 0.05 | 0.307529 | -0.084442 | (-0.2465, 0.0776) | metric_1 | A | B |
| 1 | 0.05 | 0.263036 | -0.102428 | (-0.2816, 0.0767) | metric_2 | A | B |
| 2 | 0.05 | 0.740575 | 0.028641 | (-0.1408, 0.1981) | metric_1 | A | C |
| 3 | 0.05 | 0.117435 | -0.142255 | (-0.32, 0.0355) | metric_2 | A | C |
| 4 | 0.05 | 0.556405 | 0.050312 | (-0.1172, 0.2179) | metric_1 | A | D |
| 5 | 0.05 | 0.470332 | -0.063565 | (-0.236, 0.1089) | metric_2 | A | D |
| 6 | 0.05 | 0.192379 | 0.113082 | (-0.0567, 0.2829) | metric_1 | B | C |
| 7 | 0.05 | 0.671231 | -0.039827 | (-0.2236, 0.144) | metric_2 | B | C |
| 8 | 0.05 | 0.116301 | 0.134753 | (-0.0331, 0.3026) | metric_1 | B | D |
| 9 | 0.05 | 0.670008 | 0.038863 | (-0.1398, 0.2175) | metric_2 | B | D |
| 10 | 0.05 | 0.808385 | 0.021671 | (-0.1534, 0.1967) | metric_1 | C | D |
| 11 | 0.05 | 0.384652 | 0.078690 | (-0.0986, 0.2559) | metric_2 | C | D |
Learn more¶
There is some more information on evaluating the effect of experiments using Ambrosia
Check:
Testerclass documentationAn example of making statistical inference and effect estimation on Spark data