Multiple testing corrections in Ambrosia’s Tester¶
When an experiment is evaluated on several metrics (or several groups) at once, the p-values must be corrected for multiple comparisons. Since version 0.5.2, Tester supports eight correction methods. This notebook shows the problem and how to use each one.
1. The multiple-comparisons problem¶
A single test at the 5% level has a 5% chance of a false positive. Test many metrics and the chance that at least one looks “significant” by pure luck grows fast. A multiple-testing correction keeps that chance under control.
We simulate an A/B test with six metrics: five have no real effect (metric_1..metric_5) and one has a real positive effect (metric_6).
[1]:
import numpy as np
import pandas as pd
from ambrosia.tester import Tester
N = 2000 # users per group
rng = np.random.default_rng(8)
data = {"group": ["A"] * N + ["B"] * N}
for i in range(1, 6): # metric_1..metric_5: no effect (A and B from the same distribution)
data[f"metric_{i}"] = np.r_[rng.normal(0.0, 1.0, N), rng.normal(0.0, 1.0, N)]
data["metric_6"] = np.r_[rng.normal(0.0, 1.0, N), rng.normal(0.15, 1.0, N)] # real effect in B
df = pd.DataFrame(data)
metrics = [f"metric_{i}" for i in range(1, 7)]
df.head()
[1]:
| group | metric_1 | metric_2 | metric_3 | metric_4 | metric_5 | metric_6 | |
|---|---|---|---|---|---|---|---|
| 0 | A | -1.738266 | -0.006715 | -0.886940 | 1.199254 | -1.841961 | -0.164049 |
| 1 | A | -1.336643 | -0.244478 | 0.374934 | 1.521730 | -0.789614 | -0.582462 |
| 2 | A | -1.361107 | -0.471546 | -0.497588 | 0.121381 | 0.035406 | -1.076962 |
| 3 | A | -0.351617 | 0.692823 | 0.434765 | -0.011001 | -2.566139 | 0.218516 |
| 4 | A | -2.312582 | -0.573991 | 1.012139 | 0.385061 | 0.266645 | -0.536790 |
2. Without any correction¶
Run the Tester with correction_method=None. Watch the null metrics: by chance one slips below the 0.05 threshold - a false positive.
[2]:
tester = Tester(dataframe=df, column_groups="group", metrics=metrics)
raw = tester.run("absolute", method="theory", correction_method=None, as_table=True)
raw_view = raw[["metric name", "pvalue"]].copy()
raw_view["significant @ 0.05"] = raw_view["pvalue"] < 0.05
raw_view.round(4)
[2]:
| metric name | pvalue | significant @ 0.05 | |
|---|---|---|---|
| 0 | metric_1 | 0.1387 | False |
| 1 | metric_2 | 0.4573 | False |
| 2 | metric_3 | 0.8861 | False |
| 3 | metric_4 | 0.0320 | True |
| 4 | metric_5 | 0.6910 | False |
| 5 | metric_6 | 0.0000 | True |
Here metric_4 is flagged as significant even though it has no real effect, while metric_6 (the true effect) is significant too. With five null metrics there was a ~23% chance of at least one such false positive.
3. Turning on a correction¶
Pass a method name to correction_method. "bonferroni" is the default, so existing code is unaffected. The full list of supported methods:
[3]:
from ambrosia.tools import multitest
multitest.available_methods()
[3]:
['bonferroni', 'sidak', 'holm', 'holm-sidak', 'fdr_bh', 'fdr_by', 'hommel', 'simes-hochberg']
4. Comparing the methods¶
The table puts the raw p-values next to every corrected version. All methods make the p-values larger (more conservative); they differ in how much.
[4]:
methods = ["bonferroni", "holm", "holm-sidak", "sidak", "fdr_bh", "fdr_by", "hommel", "simes-hochberg"]
comparison = pd.DataFrame({"metric": raw["metric name"].values, "raw": raw["pvalue"].values})
for m in methods:
res = tester.run("absolute", method="theory", correction_method=m, as_table=True)
comparison[m] = res["pvalue"].values
comparison.round(4)
[4]:
| metric | raw | bonferroni | holm | holm-sidak | sidak | fdr_bh | fdr_by | hommel | simes-hochberg | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | metric_1 | 0.1387 | 0.8319 | 0.5546 | 0.4495 | 0.5916 | 0.2773 | 0.6794 | 0.5546 | 0.5546 |
| 1 | metric_2 | 0.4573 | 1.0000 | 1.0000 | 0.8401 | 0.9744 | 0.6859 | 1.0000 | 0.8861 | 0.8861 |
| 2 | metric_3 | 0.8861 | 1.0000 | 1.0000 | 0.9045 | 1.0000 | 0.8861 | 1.0000 | 0.8861 | 0.8861 |
| 3 | metric_4 | 0.0320 | 0.1922 | 0.1602 | 0.1502 | 0.1774 | 0.0961 | 0.2354 | 0.1602 | 0.1602 |
| 4 | metric_5 | 0.6910 | 1.0000 | 1.0000 | 0.9045 | 0.9991 | 0.8293 | 1.0000 | 0.8861 | 0.8861 |
| 5 | metric_6 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
5. Which method should I use?¶
Family-wise error rate (FWER) -
bonferroni,holm,holm-sidak,sidak,hommel,simes-hochberg. They bound the probability of making any false positive.bonferroniis the simplest but most conservative;holmand the step-wise methods reject at least as much, so you keep more power.False discovery rate (FDR) -
fdr_bh(Benjamini-Hochberg) andfdr_by(Benjamini-Yekutieli). They control the expected share of false positives among the metrics you call significant - usually the right trade-off for a dashboard with many metrics.
How many metrics stay significant at 5% under each approach?
[5]:
alpha = 0.05
for label, cm in [("none", None), ("bonferroni", "bonferroni"), ("holm", "holm"), ("fdr_bh", "fdr_bh")]:
res = tester.run("absolute", method="theory", correction_method=cm, as_table=True)
n_sig = int((res["pvalue"] < alpha).sum())
print(f"{label:11s}: {n_sig} significant metric(s) at {alpha}")
none : 2 significant metric(s) at 0.05
bonferroni : 1 significant metric(s) at 0.05
holm : 1 significant metric(s) at 0.05
fdr_bh : 1 significant metric(s) at 0.05
none reports 2 (one is the false positive). Every correction removes the false positive, while Holm and Benjamini-Hochberg still keep the genuine metric_6.
6. What happens to confidence intervals?¶
For the constant-scaling methods (bonferroni, sidak) the confidence intervals are widened to stay consistent with the corrected decision. The step-wise methods (Holm, FDR, …) adjust only the p-values and leave the intervals at their nominal level.
[6]:
none_ci = tester.run("absolute", method="theory", correction_method=None, as_table=True)
bonf_ci = tester.run("absolute", method="theory", correction_method="bonferroni", as_table=True)
holm_ci = tester.run("absolute", method="theory", correction_method="holm", as_table=True)
pd.DataFrame({
"metric": none_ci["metric name"].values,
"CI (none)": list(none_ci["confidence_interval"]),
"CI (bonferroni - wider)": list(bonf_ci["confidence_interval"]),
"CI (holm - same as none)": list(holm_ci["confidence_interval"]),
})
[6]:
| metric | CI (none) | CI (bonferroni - wider) | CI (holm - same as none) | |
|---|---|---|---|---|
| 0 | metric_1 | (-0.0155, 0.1111) | (-0.0373, 0.133) | (-0.0155, 0.1111) |
| 1 | metric_2 | (-0.0838, 0.0377) | (-0.1048, 0.0587) | (-0.0838, 0.0377) |
| 2 | metric_3 | (-0.0664, 0.0573) | (-0.0878, 0.0787) | (-0.0664, 0.0573) |
| 3 | metric_4 | (-0.1285, -0.0058) | (-0.1497, 0.0154) | (-0.1285, -0.0058) |
| 4 | metric_5 | (-0.0486, 0.0733) | (-0.0697, 0.0944) | (-0.0486, 0.0733) |
| 5 | metric_6 | (0.1358, 0.2583) | (0.1146, 0.2795) | (0.1358, 0.2583) |
7. Aliases and summary¶
Friendly aliases are accepted, e.g. "benjamini-hochberg" == "fdr_bh" and "benjamini-yekutieli" == "fdr_by".
[7]:
alias = tester.run("absolute", method="theory", correction_method="benjamini-hochberg", as_table=True)
canonical = tester.run("absolute", method="theory", correction_method="fdr_bh", as_table=True)
bool((alias["pvalue"].values == canonical["pvalue"].values).all())
[7]:
True
Rule of thumb
Situation |
Suggested |
|---|---|
One metric, two groups |
|
A few key metrics, must avoid any false positive |
|
Many metrics on a dashboard |
|
You also need corrected confidence intervals |
|
Switching the correction is a one-argument change to Tester.run.