Effect Measurement

Tools for assessing the statistical significance of completed experiments and calculating the experimental uplift value with corresponding confidence intervals.

Multiple testing correction

When several hypotheses (number of variant combinations * number of metrics passed) are tested, the groups are compared in pairs and the p-values are adjusted for multiplicity. The correction_method parameter selects the procedure: "bonferroni" (default), "sidak", "holm", "holm-sidak", "fdr_bh" (Benjamini-Hochberg), "fdr_by" (Benjamini-Yekutieli), "hommel" or "simes-hochberg" (pass None to disable). The Benjamini-Hochberg and Benjamini-Yekutieli procedures control the false discovery rate; the others control the family-wise error rate. For "bonferroni" and "sidak" the confidence intervals are widened accordingly; the step-wise methods adjust only the p-values and leave the intervals at the nominal level. Hypotheses whose p-value cannot be computed still count toward the family size.

Tester

Unit for evaluating the results of experiments.

test

Function wrapper around the Tester class.


class ambrosia.tester.Tester(dataframe=None, df_mapping=None, experiment_results=None, column_groups=None, group_labels=None, id_column=None, first_type_errors=0.05, metrics=None, metric_funcs=None)[source]

Unit for evaluating the results of experiments.

The experiment evaluation result contains:
  • Pvalue for the selected criterion

  • Point effect estimation

  • Corresponding confidence interval for the effect

  • Boolean result - presence / absence of the effect

Parameters:
dataframePassedDataType, optional

Dataframe used with experiment results metrics.

df_mappingGroupsInfoType, optional

Dataframe which contains group labels of objects.

experiment_resultsExperimentResults, optional

Dict with separate experiment results for each group. Dict keys are used as groups labels, values must be either pandas or Spark dataframes.

column_groupsColumnNameType, optional

Column which contains groups label of objects.

group_labelsGroupLabelsType, optional

Labels for experimental groups. If column_groups contains at least two values, they will choose for labels.

id_columnColumnNameType, optional

Name of column with objects ids in df_mapping dataframe.

first_type_errorsStatErrorType, default: 0.05

I type errors values. Fix P (detect difference for equal) to be less than threshold. Used to construct confidence intervals.

metricsMetricNameType, optional

Metrics (columns of dataframe) which is used to calculate experiment result.

metric_funcsDict[str, Callable], optional

Dictionary mapping metric names to callable functions. Each function receives a pd.DataFrame (group data) and must return an array-like of numeric values. When provided, the function is used instead of column lookup for the corresponding metric name. Only supported for pandas DataFrames.

Attributes:
dataframePassedDataType

Dataframe used with experiment results metrics.

df_mappingGroupsInfoType

Dataframe which contains group labels of objects.

experiment_resultsExperimentResults, optional

Dict with separate experiment results for each group.

column_groupsColumnNameType

Column which contains groups label of objects.

group_labelsGroupLabelsType

Labels for experimental groups.

id_columnColumnNameType

Name of column with objects ids in df_mapping dataframe.

first_type_errorsStatErrorType, default: 0.05

I type errors values.

metricsMetricNameType

Columns of dataframe with experiment results.

Notes

Basic mathematic methods for evaluating experiments:

  • Theory:
    • Absolute: Using ttest, mann-whitney, others and custom criteria

    • Relative: Using delta method

  • Empiric:
    • Absolute / Relative: Building empirical distribution for T(A, B)

  • Binary:
    • Absolute: Using special binary intervals and finding pvalue = inf_a {x : 0 not in interval(x)}

    • Relative: Not implemented yet :(

Constructors:

>>> # Empty constructor
>>> tester = Tester()
>>> # You can pass Iterable or single object for some parameters
>>> tester = Tester(
>>>     dataframe=df,
>>>     columns_groups='groups',
>>>     metrics=['ltv', 'retention']
>>> )
>>> tester = Tester(metrics='retention', first_type_errors=[0.01, 0.05])
>>> # You can set a separate table containing information about
>>> # the partitioning in the experiment
>>> tester = tester = Tester(
>>>     dataframe=df, # main dataframe with metrics
>>>     df_mapping=groups, # table with information about groups
>>>     metrics='metric', # Metric to be tested
>>>     column_groups='group', # Column in df_mapping with labels
>>>     id_column='id' # Column with ids in df and df_mapping (for join)
>>> )

Setters:

>>> tester.set_metrics(['ltv', 'retention'])
>>> tester.set_dataframe(dataframe=dataframe, column_groups='groups')
>>> # You can set separate data of each group packed in special dict form
>>> tester.set_experiment_results(experiment_results=experiment_results)

Run:

>>> # You can choose effect_type to estimate: relative / absolute
>>> tester.run('absolute')
>>> # Also you can choose method
>>> tester.run('absolute', method='empriric') # emipiric for bootstrap
>>> # One can pass arguments in run() method and they will have
>>> # higher priority
>>> tester.run(metrics='ltv', data_a_group=df_a)

Use a function instead of a class:

>>> test('absolute', dataframe=df, column_groups='groups', metrics='ltv')

Examples

We’ve experimented with adding onboarding to our mobile app and would like to know about its results in terms of A/B testing. Suppose we have a loaded pandas dataframe with a column responsible for the groups in the testing and columns with metric values, such as retention. Then you can use the tester class the following way:

>>> tester = Tester(
>>>     dataframe=df,
>>>     column_groups='groups',
>>>     metrics='retention'
>>> )
>>> tester.run()
>>> # Output
>>> [{
>>>     'first_type_error' : 0.05,
>>>     'pvalue' : 0.03,
>>>     'effect' : 1.05,
>>>     'confidence_interval' : (1.01, 1.10),
>>>     'metric name': 'retention',
>>>     'group A label': 'A',
>>>     'group B label': 'B'
>>> }]
run(effect_type='absolute', method='theory', dataframe=None, df_mapping=None, experiment_results=None, id_column=None, column_groups=None, group_labels=None, metrics=None, first_type_errors=None, criterion=None, correction_method='bonferroni', as_table=True, metric_funcs=None, **kwargs)[source]

The main method for testing and evaluating experimental results.

Parameters:
effect_typestr, default: "absolute"

Effect type to calculate. Could be "absolute" or "relative".

methodstr, default: "theory"

Type of testing approach. Can take the values "theory", "empiric" or "binary".

dataframePassedDataType, optional

Data used to calculate the results of an experiment.

df_mappingGroupsInfoType, optional

Dataframe which contains group labels of objects.

experiment_resultsExperimentResults

Dict with separate experiment results for each group. Dict keys are used as groups labels, values must be either pandas or Spark dataframes.

column_groupsColumnNameType

Column which contains groups label of objects.

group_labelsGroupLabelsType

Labels for experimental groups.

id_columnColumnNameType

Name of column with objects ids in df_mapping dataframe.

first_type_errorsStatErrorType, default: 0.05

I type errors values.

metricsMetricNameType

Columns of dataframe with experiment results.

criterionABStatCriterion, optional

Statistical criterion for hypotheses testing. If method is "theory" and no criterion provided, ttest for independent samples will be used.

correction_methodUnion[str, None], default: "bonferroni"

Method for multiple hypothesis testing correction of p-values. Supported values: "bonferroni", "sidak", "holm", "holm-sidak", "fdr_bh" (Benjamini-Hochberg), "fdr_by" (Benjamini-Yekutieli), "hommel", "simes-hochberg"; pass None to disable correction. The family size equals the number of group-pair combinations times the number of metrics. For "bonferroni" and "sidak" confidence intervals are widened accordingly; the other, step-wise methods adjust only the p-values and leave intervals at the nominal level.

as_tablebool, default: True

Return the test results as a pandas dataframe. If False, a list of dicts with results will be returned.

metric_funcsDict[str, Callable], optional

Dictionary mapping metric names to callable functions. Each function receives a group pd.DataFrame and returns array-like values. Overrides functions set in constructor for matching metric names. Only pandas DataFrames supported.

**kwargsDict

Other keyword arguments.

Returns:
resulttypes.TesterResult

Experiment results as pandas table or list of dicts for each metric and first type error.

ambrosia.tester.test(effect_type='absolute', method='theory', dataframe=None, df_mapping=None, experiment_results=None, id_column=None, column_groups=None, group_labels=None, metrics=None, first_type_errors=None, criterion=None, correction_method='bonferroni', as_table=True, metric_funcs=None, **kwargs)[source]

Function wrapper around the Tester class.

Apply on the experimental data to get the results of an experiment.

Creates an instance of the Tester class internally and execute run method with corresponding arguments.

Parameters:
effect_typestr, default: "absolute"

Effect type to calculate. Could be "absolute" or "relative".

methodstr, default: "theory"

Type of testing approach. Can take the values "theory", "empiric" or "binary".

dataframePassedDataType, optional

Data used to calculate the results of an experiment.

df_mappingGroupsInfoType, optional

Dataframe which contains group labels of objects.

experiment_resultsExperimentResults

Dict with separate experiment results for each group. Dict keys are used as groups labels, values must be either pandas or Spark dataframes.

column_groupsColumnNameType

Column which contains groups label of objects.

group_labelsGroupLabelsType

Labels for experimental groups.

id_columnColumnNameType

Name of column with objects ids in df_mapping dataframe.

first_type_errorsStatErrorType, default: 0.05

I type errors values.

metricsMetricNameType

Columns of dataframe with experiment results.

criterionABStatCriterion, optional

Statistical criterion for hypotheses testing. If method is "theory" and no criterion provided, ttest for independent samples will be used.

correction_methodUnion[str, None], default: "bonferroni"

Method for multiple hypothesis testing correction of p-values. Supported values: "bonferroni", "sidak", "holm", "holm-sidak", "fdr_bh" (Benjamini-Hochberg), "fdr_by" (Benjamini-Yekutieli), "hommel", "simes-hochberg"; pass None to disable correction. The family size equals the number of group-pair combinations times the number of metrics. For "bonferroni" and "sidak" confidence intervals are widened accordingly; the other, step-wise methods adjust only the p-values and leave intervals at the nominal level.

as_tablebool, default: True

Return the test results as a pandas dataframe. If False, a list of dicts with results will be returned.

metric_funcsDict[str, Callable], optional

Dictionary mapping metric names to callable functions. Each function receives a group pd.DataFrame and returns array-like values. Only pandas DataFrames supported.

**kwargsDict

Other keyword arguments.

Returns:
resulttypes.TesterResult

Experiment results as pandas table or list of dicts for each metric and first type error.

Examples of using testing tools