Multiple testing corrections in Ambrosia’s Tester

When an experiment is evaluated on several metrics (or several groups) at once, the p-values must be corrected for multiple comparisons. Since version 0.5.2, Tester supports eight correction methods. This notebook shows the problem and how to use each one.

1. The multiple-comparisons problem

A single test at the 5% level has a 5% chance of a false positive. Test many metrics and the chance that at least one looks “significant” by pure luck grows fast. A multiple-testing correction keeps that chance under control.

We simulate an A/B test with six metrics: five have no real effect (metric_1..metric_5) and one has a real positive effect (metric_6).

[1]:
import numpy as np
import pandas as pd

from ambrosia.tester import Tester

N = 2000  # users per group
rng = np.random.default_rng(8)

data = {"group": ["A"] * N + ["B"] * N}
for i in range(1, 6):  # metric_1..metric_5: no effect (A and B from the same distribution)
    data[f"metric_{i}"] = np.r_[rng.normal(0.0, 1.0, N), rng.normal(0.0, 1.0, N)]
data["metric_6"] = np.r_[rng.normal(0.0, 1.0, N), rng.normal(0.15, 1.0, N)]  # real effect in B

df = pd.DataFrame(data)
metrics = [f"metric_{i}" for i in range(1, 7)]
df.head()
[1]:
group metric_1 metric_2 metric_3 metric_4 metric_5 metric_6
0 A -1.738266 -0.006715 -0.886940 1.199254 -1.841961 -0.164049
1 A -1.336643 -0.244478 0.374934 1.521730 -0.789614 -0.582462
2 A -1.361107 -0.471546 -0.497588 0.121381 0.035406 -1.076962
3 A -0.351617 0.692823 0.434765 -0.011001 -2.566139 0.218516
4 A -2.312582 -0.573991 1.012139 0.385061 0.266645 -0.536790

2. Without any correction

Run the Tester with correction_method=None. Watch the null metrics: by chance one slips below the 0.05 threshold - a false positive.

[2]:
tester = Tester(dataframe=df, column_groups="group", metrics=metrics)

raw = tester.run("absolute", method="theory", correction_method=None, as_table=True)
raw_view = raw[["metric name", "pvalue"]].copy()
raw_view["significant @ 0.05"] = raw_view["pvalue"] < 0.05
raw_view.round(4)
[2]:
metric name pvalue significant @ 0.05
0 metric_1 0.1387 False
1 metric_2 0.4573 False
2 metric_3 0.8861 False
3 metric_4 0.0320 True
4 metric_5 0.6910 False
5 metric_6 0.0000 True

Here metric_4 is flagged as significant even though it has no real effect, while metric_6 (the true effect) is significant too. With five null metrics there was a ~23% chance of at least one such false positive.

3. Turning on a correction

Pass a method name to correction_method. "bonferroni" is the default, so existing code is unaffected. The full list of supported methods:

[3]:
from ambrosia.tools import multitest

multitest.available_methods()
[3]:
['bonferroni', 'sidak', 'holm', 'holm-sidak', 'fdr_bh', 'fdr_by', 'hommel', 'simes-hochberg']

4. Comparing the methods

The table puts the raw p-values next to every corrected version. All methods make the p-values larger (more conservative); they differ in how much.

[4]:
methods = ["bonferroni", "holm", "holm-sidak", "sidak", "fdr_bh", "fdr_by", "hommel", "simes-hochberg"]

comparison = pd.DataFrame({"metric": raw["metric name"].values, "raw": raw["pvalue"].values})
for m in methods:
    res = tester.run("absolute", method="theory", correction_method=m, as_table=True)
    comparison[m] = res["pvalue"].values
comparison.round(4)
[4]:
metric raw bonferroni holm holm-sidak sidak fdr_bh fdr_by hommel simes-hochberg
0 metric_1 0.1387 0.8319 0.5546 0.4495 0.5916 0.2773 0.6794 0.5546 0.5546
1 metric_2 0.4573 1.0000 1.0000 0.8401 0.9744 0.6859 1.0000 0.8861 0.8861
2 metric_3 0.8861 1.0000 1.0000 0.9045 1.0000 0.8861 1.0000 0.8861 0.8861
3 metric_4 0.0320 0.1922 0.1602 0.1502 0.1774 0.0961 0.2354 0.1602 0.1602
4 metric_5 0.6910 1.0000 1.0000 0.9045 0.9991 0.8293 1.0000 0.8861 0.8861
5 metric_6 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

5. Which method should I use?

  • Family-wise error rate (FWER) - bonferroni, holm, holm-sidak, sidak, hommel, simes-hochberg. They bound the probability of making any false positive. bonferroni is the simplest but most conservative; holm and the step-wise methods reject at least as much, so you keep more power.

  • False discovery rate (FDR) - fdr_bh (Benjamini-Hochberg) and fdr_by (Benjamini-Yekutieli). They control the expected share of false positives among the metrics you call significant - usually the right trade-off for a dashboard with many metrics.

How many metrics stay significant at 5% under each approach?

[5]:
alpha = 0.05
for label, cm in [("none", None), ("bonferroni", "bonferroni"), ("holm", "holm"), ("fdr_bh", "fdr_bh")]:
    res = tester.run("absolute", method="theory", correction_method=cm, as_table=True)
    n_sig = int((res["pvalue"] < alpha).sum())
    print(f"{label:11s}: {n_sig} significant metric(s) at {alpha}")
none       : 2 significant metric(s) at 0.05
bonferroni : 1 significant metric(s) at 0.05
holm       : 1 significant metric(s) at 0.05
fdr_bh     : 1 significant metric(s) at 0.05

none reports 2 (one is the false positive). Every correction removes the false positive, while Holm and Benjamini-Hochberg still keep the genuine metric_6.

6. What happens to confidence intervals?

For the constant-scaling methods (bonferroni, sidak) the confidence intervals are widened to stay consistent with the corrected decision. The step-wise methods (Holm, FDR, …) adjust only the p-values and leave the intervals at their nominal level.

[6]:
none_ci = tester.run("absolute", method="theory", correction_method=None, as_table=True)
bonf_ci = tester.run("absolute", method="theory", correction_method="bonferroni", as_table=True)
holm_ci = tester.run("absolute", method="theory", correction_method="holm", as_table=True)

pd.DataFrame({
    "metric": none_ci["metric name"].values,
    "CI (none)": list(none_ci["confidence_interval"]),
    "CI (bonferroni - wider)": list(bonf_ci["confidence_interval"]),
    "CI (holm - same as none)": list(holm_ci["confidence_interval"]),
})
[6]:
metric CI (none) CI (bonferroni - wider) CI (holm - same as none)
0 metric_1 (-0.0155, 0.1111) (-0.0373, 0.133) (-0.0155, 0.1111)
1 metric_2 (-0.0838, 0.0377) (-0.1048, 0.0587) (-0.0838, 0.0377)
2 metric_3 (-0.0664, 0.0573) (-0.0878, 0.0787) (-0.0664, 0.0573)
3 metric_4 (-0.1285, -0.0058) (-0.1497, 0.0154) (-0.1285, -0.0058)
4 metric_5 (-0.0486, 0.0733) (-0.0697, 0.0944) (-0.0486, 0.0733)
5 metric_6 (0.1358, 0.2583) (0.1146, 0.2795) (0.1358, 0.2583)

7. Aliases and summary

Friendly aliases are accepted, e.g. "benjamini-hochberg" == "fdr_bh" and "benjamini-yekutieli" == "fdr_by".

[7]:
alias = tester.run("absolute", method="theory", correction_method="benjamini-hochberg", as_table=True)
canonical = tester.run("absolute", method="theory", correction_method="fdr_bh", as_table=True)
bool((alias["pvalue"].values == canonical["pvalue"].values).all())
[7]:
True

Rule of thumb

Situation

Suggested correction_method

One metric, two groups

None (nothing to correct)

A few key metrics, must avoid any false positive

"holm" (or "bonferroni")

Many metrics on a dashboard

"fdr_bh"

You also need corrected confidence intervals

"bonferroni" or "sidak"

Switching the correction is a one-argument change to Tester.run.