Effect Measurement¶

Tools for assessing the statistical significance of completed experiments and calculating the experimental uplift value with corresponding confidence intervals.

Multiple testing correction

When several hypotheses (number of variant combinations * number of metrics passed) are tested, the groups are compared in pairs and the p-values are adjusted for multiplicity. The correction_method parameter selects the procedure: "bonferroni" (default), "sidak", "holm", "holm-sidak", "fdr_bh" (Benjamini-Hochberg), "fdr_by" (Benjamini-Yekutieli), "hommel" or "simes-hochberg" (pass None to disable). The Benjamini-Hochberg and Benjamini-Yekutieli procedures control the false discovery rate; the others control the family-wise error rate. For "bonferroni" and "sidak" the confidence intervals are widened accordingly; the step-wise methods adjust only the p-values and leave the intervals at the nominal level. Hypotheses whose p-value cannot be computed still count toward the family size.

`Tester`	Unit for evaluating the results of experiments.
`test`	Function wrapper around the `Tester` class.

class ambrosia.tester.Tester(dataframe=None, df_mapping=None, experiment_results=None, column_groups=None, group_labels=None, id_column=None, first_type_errors=0.05, metrics=None, metric_funcs=None)[source]¶

Unit for evaluating the results of experiments.

The experiment evaluation result contains:

Pvalue for the selected criterion
Point effect estimation
Corresponding confidence interval for the effect
Boolean result - presence / absence of the effect

Parameters:

dataframePassedDataType, optional: Dataframe used with experiment results metrics.
df_mappingGroupsInfoType, optional: Dataframe which contains group labels of objects.
experiment_resultsExperimentResults, optional: Dict with separate experiment results for each group. Dict keys are used as groups labels, values must be either pandas or Spark dataframes.
column_groupsColumnNameType, optional: Column which contains groups label of objects.
group_labelsGroupLabelsType, optional: Labels for experimental groups. If column_groups contains at least two values, they will choose for labels.
id_columnColumnNameType, optional: Name of column with objects ids in df_mapping dataframe.
first_type_errorsStatErrorType, default: 0.05: I type errors values. Fix P (detect difference for equal) to be less than threshold. Used to construct confidence intervals.
metricsMetricNameType, optional: Metrics (columns of dataframe) which is used to calculate experiment result.
metric_funcsDict[str, Callable], optional: Dictionary mapping metric names to callable functions. Each function receives a pd.DataFrame (group data) and must return an array-like of numeric values. When provided, the function is used instead of column lookup for the corresponding metric name. Only supported for pandas DataFrames.

Attributes:

dataframePassedDataType: Dataframe used with experiment results metrics.
df_mappingGroupsInfoType: Dataframe which contains group labels of objects.
experiment_resultsExperimentResults, optional: Dict with separate experiment results for each group.
column_groupsColumnNameType: Column which contains groups label of objects.
group_labelsGroupLabelsType: Labels for experimental groups.
id_columnColumnNameType: Name of column with objects ids in df_mapping dataframe.
first_type_errorsStatErrorType, default: 0.05: I type errors values.
metricsMetricNameType: Columns of dataframe with experiment results.

Notes

Basic mathematic methods for evaluating experiments:

Theory:

Absolute: Using ttest, mann-whitney, others and custom criteria

Relative: Using delta method

Empiric:

Absolute / Relative: Building empirical distribution for T(A, B)

Binary:

Absolute: Using special binary intervals and finding pvalue = inf_a {x : 0 not in interval(x)}

Relative: Not implemented yet :(

Constructors:

>>> # Empty constructor
>>> tester = Tester()
>>> # You can pass Iterable or single object for some parameters
>>> tester = Tester(
>>>     dataframe=df,
>>>     columns_groups='groups',
>>>     metrics=['ltv', 'retention']
>>> )
>>> tester = Tester(metrics='retention', first_type_errors=[0.01, 0.05])
>>> # You can set a separate table containing information about
>>> # the partitioning in the experiment
>>> tester = tester = Tester(
>>>     dataframe=df, # main dataframe with metrics
>>>     df_mapping=groups, # table with information about groups
>>>     metrics='metric', # Metric to be tested
>>>     column_groups='group', # Column in df_mapping with labels
>>>     id_column='id' # Column with ids in df and df_mapping (for join)
>>> )

Setters:

>>> tester.set_metrics(['ltv', 'retention'])
>>> tester.set_dataframe(dataframe=dataframe, column_groups='groups')
>>> # You can set separate data of each group packed in special dict form
>>> tester.set_experiment_results(experiment_results=experiment_results)

Run:

>>> # You can choose effect_type to estimate: relative / absolute
>>> tester.run('absolute')
>>> # Also you can choose method
>>> tester.run('absolute', method='empriric') # emipiric for bootstrap
>>> # One can pass arguments in run() method and they will have
>>> # higher priority
>>> tester.run(metrics='ltv', data_a_group=df_a)

Use a function instead of a class:

>>> test('absolute', dataframe=df, column_groups='groups', metrics='ltv')

Examples

We’ve experimented with adding onboarding to our mobile app and would like to know about its results in terms of A/B testing. Suppose we have a loaded pandas dataframe with a column responsible for the groups in the testing and columns with metric values, such as retention. Then you can use the tester class the following way:

>>> tester = Tester(
>>>     dataframe=df,
>>>     column_groups='groups',
>>>     metrics='retention'
>>> )
>>> tester.run()
>>> # Output
>>> [{
>>>     'first_type_error' : 0.05,
>>>     'pvalue' : 0.03,
>>>     'effect' : 1.05,
>>>     'confidence_interval' : (1.01, 1.10),
>>>     'metric name': 'retention',
>>>     'group A label': 'A',
>>>     'group B label': 'B'
>>> }]

run(effect_type='absolute', method='theory', dataframe=None, df_mapping=None, experiment_results=None, id_column=None, column_groups=None, group_labels=None, metrics=None, first_type_errors=None, criterion=None, correction_method='bonferroni', as_table=True, metric_funcs=None, **kwargs)[source]¶

The main method for testing and evaluating experimental results.

Parameters:

effect_typestr, default: "absolute": Effect type to calculate. Could be "absolute" or "relative".
methodstr, default: "theory": Type of testing approach. Can take the values "theory", "empiric" or "binary".
dataframePassedDataType, optional: Data used to calculate the results of an experiment.
df_mappingGroupsInfoType, optional: Dataframe which contains group labels of objects.
experiment_resultsExperimentResults: Dict with separate experiment results for each group. Dict keys are used as groups labels, values must be either pandas or Spark dataframes.
column_groupsColumnNameType: Column which contains groups label of objects.
group_labelsGroupLabelsType: Labels for experimental groups.
id_columnColumnNameType: Name of column with objects ids in df_mapping dataframe.
first_type_errorsStatErrorType, default: 0.05: I type errors values.
metricsMetricNameType: Columns of dataframe with experiment results.
criterionABStatCriterion, optional: Statistical criterion for hypotheses testing. If method is "theory" and no criterion provided, ttest for independent samples will be used.
correction_methodUnion[str, None], default: "bonferroni": Method for multiple hypothesis testing correction of p-values. Supported values: "bonferroni", "sidak", "holm", "holm-sidak", "fdr_bh" (Benjamini-Hochberg), "fdr_by" (Benjamini-Yekutieli), "hommel", "simes-hochberg"; pass None to disable correction. The family size equals the number of group-pair combinations times the number of metrics. For "bonferroni" and "sidak" confidence intervals are widened accordingly; the other, step-wise methods adjust only the p-values and leave intervals at the nominal level.
as_tablebool, default: True: Return the test results as a pandas dataframe. If False, a list of dicts with results will be returned.
metric_funcsDict[str, Callable], optional: Dictionary mapping metric names to callable functions. Each function receives a group pd.DataFrame and returns array-like values. Overrides functions set in constructor for matching metric names. Only pandas DataFrames supported.
**kwargsDict: Other keyword arguments.

Returns:

resulttypes.TesterResult: Experiment results as pandas table or list of dicts for each metric and first type error.

ambrosia.tester.test(effect_type='absolute', method='theory', dataframe=None, df_mapping=None, experiment_results=None, id_column=None, column_groups=None, group_labels=None, metrics=None, first_type_errors=None, criterion=None, correction_method='bonferroni', as_table=True, metric_funcs=None, **kwargs)[source]¶

Function wrapper around the Tester class.

Apply on the experimental data to get the results of an experiment.

Creates an instance of the Tester class internally and execute run method with corresponding arguments.

Parameters:

effect_typestr, default: "absolute": Effect type to calculate. Could be "absolute" or "relative".
methodstr, default: "theory": Type of testing approach. Can take the values "theory", "empiric" or "binary".
dataframePassedDataType, optional: Data used to calculate the results of an experiment.
df_mappingGroupsInfoType, optional: Dataframe which contains group labels of objects.
experiment_resultsExperimentResults: Dict with separate experiment results for each group. Dict keys are used as groups labels, values must be either pandas or Spark dataframes.
column_groupsColumnNameType: Column which contains groups label of objects.
group_labelsGroupLabelsType: Labels for experimental groups.
id_columnColumnNameType: Name of column with objects ids in df_mapping dataframe.
first_type_errorsStatErrorType, default: 0.05: I type errors values.
metricsMetricNameType: Columns of dataframe with experiment results.
criterionABStatCriterion, optional: Statistical criterion for hypotheses testing. If method is "theory" and no criterion provided, ttest for independent samples will be used.
correction_methodUnion[str, None], default: "bonferroni": Method for multiple hypothesis testing correction of p-values. Supported values: "bonferroni", "sidak", "holm", "holm-sidak", "fdr_bh" (Benjamini-Hochberg), "fdr_by" (Benjamini-Yekutieli), "hommel", "simes-hochberg"; pass None to disable correction. The family size equals the number of group-pair combinations times the number of metrics. For "bonferroni" and "sidak" confidence intervals are widened accordingly; the other, step-wise methods adjust only the p-values and leave intervals at the nominal level.
as_tablebool, default: True: Return the test results as a pandas dataframe. If False, a list of dicts with results will be returned.
metric_funcsDict[str, Callable], optional: Dictionary mapping metric names to callable functions. Each function receives a group pd.DataFrame and returns array-like values. Only pandas DataFrames supported.
**kwargsDict: Other keyword arguments.

Returns:

resulttypes.TesterResult: Experiment results as pandas table or list of dicts for each metric and first type error.

Effect Measurement¶

Examples of using testing tools¶