Multiple testing corrections in Ambrosia’s `Tester`¶

When an experiment is evaluated on several metrics (or several groups) at once, the p-values must be corrected for multiple comparisons. Since version 0.5.2, Tester supports eight correction methods. This notebook shows the problem and how to use each one.

1. The multiple-comparisons problem¶

A single test at the 5% level has a 5% chance of a false positive. Test many metrics and the chance that at least one looks “significant” by pure luck grows fast. A multiple-testing correction keeps that chance under control.

We simulate an A/B test with six metrics: five have no real effect (metric_1..metric_5) and one has a real positive effect (metric_6).

[1]:

import numpy as np
import pandas as pd

from ambrosia.tester import Tester

N = 2000  # users per group
rng = np.random.default_rng(8)

data = {"group": ["A"] * N + ["B"] * N}
for i in range(1, 6):  # metric_1..metric_5: no effect (A and B from the same distribution)
    data[f"metric_{i}"] = np.r_[rng.normal(0.0, 1.0, N), rng.normal(0.0, 1.0, N)]
data["metric_6"] = np.r_[rng.normal(0.0, 1.0, N), rng.normal(0.15, 1.0, N)]  # real effect in B

df = pd.DataFrame(data)
metrics = [f"metric_{i}" for i in range(1, 7)]
df.head()

[1]:

	group	metric_1	metric_2	metric_3	metric_4	metric_5	metric_6
0	A	-1.738266	-0.006715	-0.886940	1.199254	-1.841961	-0.164049
1	A	-1.336643	-0.244478	0.374934	1.521730	-0.789614	-0.582462
2	A	-1.361107	-0.471546	-0.497588	0.121381	0.035406	-1.076962
3	A	-0.351617	0.692823	0.434765	-0.011001	-2.566139	0.218516
4	A	-2.312582	-0.573991	1.012139	0.385061	0.266645	-0.536790

2. Without any correction¶

Run the Tester with correction_method=None. Watch the null metrics: by chance one slips below the 0.05 threshold - a false positive.

[2]:

tester = Tester(dataframe=df, column_groups="group", metrics=metrics)

raw = tester.run("absolute", method="theory", correction_method=None, as_table=True)
raw_view = raw[["metric name", "pvalue"]].copy()
raw_view["significant @ 0.05"] = raw_view["pvalue"] < 0.05
raw_view.round(4)

[2]:

	metric name	pvalue	significant @ 0.05
0	metric_1	0.1387	False
1	metric_2	0.4573	False
2	metric_3	0.8861	False
3	metric_4	0.0320	True
4	metric_5	0.6910	False
5	metric_6	0.0000	True

Here metric_4 is flagged as significant even though it has no real effect, while metric_6 (the true effect) is significant too. With five null metrics there was a ~23% chance of at least one such false positive.

3. Turning on a correction¶

Pass a method name to correction_method. "bonferroni" is the default, so existing code is unaffected. The full list of supported methods:

[3]:

from ambrosia.tools import multitest

multitest.available_methods()

[3]:

['bonferroni', 'sidak', 'holm', 'holm-sidak', 'fdr_bh', 'fdr_by', 'hommel', 'simes-hochberg']

4. Comparing the methods¶

The table puts the raw p-values next to every corrected version. All methods make the p-values larger (more conservative); they differ in how much.

[4]:

methods = ["bonferroni", "holm", "holm-sidak", "sidak", "fdr_bh", "fdr_by", "hommel", "simes-hochberg"]

comparison = pd.DataFrame({"metric": raw["metric name"].values, "raw": raw["pvalue"].values})
for m in methods:
    res = tester.run("absolute", method="theory", correction_method=m, as_table=True)
    comparison[m] = res["pvalue"].values
comparison.round(4)

[4]:

	metric	raw	bonferroni	holm	holm-sidak	sidak	fdr_bh	fdr_by	hommel	simes-hochberg
0	metric_1	0.1387	0.8319	0.5546	0.4495	0.5916	0.2773	0.6794	0.5546	0.5546
1	metric_2	0.4573	1.0000	1.0000	0.8401	0.9744	0.6859	1.0000	0.8861	0.8861
2	metric_3	0.8861	1.0000	1.0000	0.9045	1.0000	0.8861	1.0000	0.8861	0.8861
3	metric_4	0.0320	0.1922	0.1602	0.1502	0.1774	0.0961	0.2354	0.1602	0.1602
4	metric_5	0.6910	1.0000	1.0000	0.9045	0.9991	0.8293	1.0000	0.8861	0.8861
5	metric_6	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000

5. Which method should I use?¶

Family-wise error rate (FWER) - bonferroni, holm, holm-sidak, sidak, hommel, simes-hochberg. They bound the probability of making any false positive. bonferroni is the simplest but most conservative; holm and the step-wise methods reject at least as much, so you keep more power.
False discovery rate (FDR) - fdr_bh (Benjamini-Hochberg) and fdr_by (Benjamini-Yekutieli). They control the expected share of false positives among the metrics you call significant - usually the right trade-off for a dashboard with many metrics.

How many metrics stay significant at 5% under each approach?

[5]:

alpha = 0.05
for label, cm in [("none", None), ("bonferroni", "bonferroni"), ("holm", "holm"), ("fdr_bh", "fdr_bh")]:
    res = tester.run("absolute", method="theory", correction_method=cm, as_table=True)
    n_sig = int((res["pvalue"] < alpha).sum())
    print(f"{label:11s}: {n_sig} significant metric(s) at {alpha}")

none       : 2 significant metric(s) at 0.05
bonferroni : 1 significant metric(s) at 0.05
holm       : 1 significant metric(s) at 0.05
fdr_bh     : 1 significant metric(s) at 0.05

none reports 2 (one is the false positive). Every correction removes the false positive, while Holm and Benjamini-Hochberg still keep the genuine metric_6.

6. What happens to confidence intervals?¶

For the constant-scaling methods (bonferroni, sidak) the confidence intervals are widened to stay consistent with the corrected decision. The step-wise methods (Holm, FDR, …) adjust only the p-values and leave the intervals at their nominal level.

[6]:

none_ci = tester.run("absolute", method="theory", correction_method=None, as_table=True)
bonf_ci = tester.run("absolute", method="theory", correction_method="bonferroni", as_table=True)
holm_ci = tester.run("absolute", method="theory", correction_method="holm", as_table=True)

pd.DataFrame({
    "metric": none_ci["metric name"].values,
    "CI (none)": list(none_ci["confidence_interval"]),
    "CI (bonferroni - wider)": list(bonf_ci["confidence_interval"]),
    "CI (holm - same as none)": list(holm_ci["confidence_interval"]),
})

[6]:

	metric	CI (none)	CI (bonferroni - wider)	CI (holm - same as none)
0	metric_1	(-0.0155, 0.1111)	(-0.0373, 0.133)	(-0.0155, 0.1111)
1	metric_2	(-0.0838, 0.0377)	(-0.1048, 0.0587)	(-0.0838, 0.0377)
2	metric_3	(-0.0664, 0.0573)	(-0.0878, 0.0787)	(-0.0664, 0.0573)
3	metric_4	(-0.1285, -0.0058)	(-0.1497, 0.0154)	(-0.1285, -0.0058)
4	metric_5	(-0.0486, 0.0733)	(-0.0697, 0.0944)	(-0.0486, 0.0733)
5	metric_6	(0.1358, 0.2583)	(0.1146, 0.2795)	(0.1358, 0.2583)

7. Aliases and summary¶

Friendly aliases are accepted, e.g. "benjamini-hochberg" == "fdr_bh" and "benjamini-yekutieli" == "fdr_by".

[7]:

alias = tester.run("absolute", method="theory", correction_method="benjamini-hochberg", as_table=True)
canonical = tester.run("absolute", method="theory", correction_method="fdr_bh", as_table=True)
bool((alias["pvalue"].values == canonical["pvalue"].values).all())

[7]:

True

Rule of thumb

Situation	Suggested `correction_method`
One metric, two groups	`None` (nothing to correct)
A few key metrics, must avoid any false positive	`"holm"` (or `"bonferroni"`)
Many metrics on a dashboard	`"fdr_bh"`
You also need corrected confidence intervals	`"bonferroni"` or `"sidak"`

Switching the correction is a one-argument change to Tester.run.

Multiple testing corrections in Ambrosia’s Tester¶