Groups Splitting¶

The following classes and functions helps to split batch data into experimental groups using different approaches.

Real-time Splitter availability

The real-time splitting tools are under development. This functionality is intended to be applied to batch data only.

`Splitter`	Unit for creating experimental groups from batch data.
`load_from_config`	Restore a `Splitter` class instance from a yaml config.
`split`	Function wrapper around the `Splitter` class.

class ambrosia.splitter.Splitter(dataframe=None, id_column=None, groups_size=None, test_group_ids=None, fit_columns=None, strat_columns=None)[source]¶

Unit for creating experimental groups from batch data.

Split your data into groups of selected size with respect to:

Stratification columns
Metric distance of objects in feature space
Set of passed ids

Parameters:

dataframePassedDataType, optional: Dataframe or string name of .csv table which contains data used for groups split.
id_columnIdColumnNameType, optional: Name of id column which is used in hash split.
groups_sizeint, optional: Size of the splitted groups.
test_group_idsPeriodColumnNamesType, optional: Ids of objects which are in B(test) group. Used in tasks of post experiment A(control) group pick up.
fit_columnsPeriodColumnNamesType, optional: List of columns names which values will be interpreted as coordinates of points in multidimensional space during metric split.
strat_columnsPeriodColumnNamesType, optional: Columns for stratification. https://en.wikipedia.org/wiki/Stratified_sampling

Attributes:

dataframePassedDataType: Pandas or Spark dataframe with split data.
id_columnIdColumnNameType: Name of id column which is used in hash split.
groups_sizeint: Split size of groups.
test_group_idsPeriodColumnNamesType: Ids of objects which are in B(test) group.
fit_columnsPeriodColumnNamesType: List of columns names used for metric split.
strat_columnsPeriodColumnNamesType: Stratification columns names.

Notes

Main methods for split:

Simple:

Randomly chosen groups (via np.random.choice).

Hash:

Using hashing of identifiers and distribution by buckets, selects the desired buckets for groups formation.

Metric:

For a fixed reference group or a randomly selected one, other groups are selected using the nearest neighbor method (for desired list of columns passed in fit_columns parameter).

Constructors:

>>> # Empty constructor
>>> splitter = Splitter()
>>> # Some data
>>> splitter = Splitter(dataframe=df,
>>>                     id_column='my_id_column',
>>>                     strat_columns=['gender', 'age'],
>>>                     test_group_ids=ids_for_B_group
>>> )

Setters:

>>> splitter.set_dataframe(dataframe)
>>> # You can pass string for pd.read_csv
>>> splitter.set_dataframe('name_of_table.csv')
>>> # Other setters
>>> splitter.set_group_size(1000)
>>> splitter.set_strat_columns(['age', 'region'])

Run:

>>> splitter.run(method='hash', groups_size=10000)
>>> splitter.run(method='metric'
>>>              test_group_ids=b_group,
>>>              id_column='id',
>>>              strat_columns=['age', 'city']
>>>              fit_columns=['metric_history_column', 'other_metric']
>>>              method_meric='fast', # It is used as kwarg
>>>              norm='l2' # It is used as kwarg
>>> )

Load from yaml config:

>>> config = '''
            !splitter # <--- this is yaml tag (important!)
                groups_size:
                    1000
                id_column:
                    id
                strat_columns:
                    - age
                    - country
        '''
>>> splitter = yaml.load(config)
>>> # Or use the implmented function
>>> splitter = load_from_config(config)

Examples

Our development team decided to add onboarding to the mobile app. Already knowing the required group size, we would like to select users for groups A and B respectively. Using the splitter class, this task could be done in the following way:

>>> splitter = Splitter(dataframe=dataframe)
>>> splitter.run(group_size=1000, method='hash', salt='onboarding')

Suppose now, we know that people of different ages and from several countries use our application, so we would like to take this into account during split. To do this, you might use stratification, which can be easily applied by passing only one additional parameter:

>>> splitter = Splitter(data=dataframe, strat_columns=['age', 'country'])
>>> splitter.run(group_size=1000, method='hash', salt='onboarding')

If we have fixed users for the testing group, this can be specified as a parameter:

>>> splitter = Splitter(data=dataframe, strat_columns=['age', 'country'])
>>> splitter.run(method='hash',
>>>              salt='onboarding',
>>>              test_group_ids=B_group_id
>>> )

run(method, dataframe=None, id_column=None, groups_size=None, part_of_table=None, groups_number=2, test_group_ids=None, strat_columns=None, salt=None, fit_columns=None, **kwargs)[source]¶

Perform a split into groups with selected or saved parameters.

Parameters:

methodstr: Split method, for example "hash".
dataframePassedDataType, optional: Dataframe or string name of .csv table which contains data used for groups split.
id_columnIdColumnNameType, optional: Name of id column which is used in hash split.
groups_sizeint, optional: Size of the splitted groups.
part_of_table: float, optional: Split factor(for group A) for tasks of dataframe full split. If is not None, then overrides groups_size parameter during the split.
groups_numberint, default: 2: Number of groups to be splitted.
test_group_idsPeriodColumnNamesType, optional: Ids of objects which are in B(test) group. Used in tasks of post experiment A(control) group pick up.
strat_columnsPeriodColumnNamesType, optional: Columns for stratification. https://en.wikipedia.org/wiki/Stratified_sampling
saltstr, optional: Salt for hashing in hash-split.
fit_columnsPeriodColumnNamesType, optional: List of columns names which values will be interpreted as coordinates of points in multidimensional space during metric split.
**kwargsDict: Other keyword arguments.

Returns:

groupspd.DataFrame: Returns a dataframe with groups and label column. Dataframe will contain all columns of the original dataframe.

Other Parameters:

threadsint, default1: Number of threads used for calculations.

ambrosia.splitter.load_from_config(yaml_config, loader=<class 'yaml.loader.Loader'>)[source]¶

Restore a Splitter class instance from a yaml config.

For yaml_config parameter you can pass file name with config, which must ends with .yaml, for example: “config.yaml”. For loader you can choose SafeLoader.

ambrosia.splitter.split(method, dataframe=None, id_column=None, groups_size=None, part_of_table=None, groups_number=2, test_group_ids=None, strat_columns=None, salt=None, fit_columns=None, threads=1, **kwargs)[source]¶

Function wrapper around the Splitter class.

Used to create splitted groups from the dataframe.

Creates an instance of the Splitter class internally and execute run method with corresponding arguments.

Parameters:

methodstr: Split method, for example "hash".
dataframePassedDataType, optional: Dataframe or string name of .csv table which contains data used for groups split.
id_columnIdColumnNameType, optional: Name of id column which is used in hash split.
groups_sizeint, optional: Size of the splitted groups.
part_of_table: float, optional: Split factor(for group A) for tasks of dataframe full split. If is not None, then overrides groups_size parameter during the split.
groups_numberint, default2: Number of groups to be splitted.
test_group_idsPeriodColumnNamesType, optional: Ids of objects which are in B(test) group. Used in tasks of post experiment A(control) group pick up.
strat_columnsPeriodColumnNamesType, optional: Columns for stratification. https://en.wikipedia.org/wiki/Stratified_sampling
saltstr, optional: Salt for hashing in hash-split.
fit_columnsPeriodColumnNamesType, optional: List of columns names which values will be interpreted as coordinates of points in multidimensional space during metric split.
threadsint, default1: Number of threads used for calculations.
**kwargsDict: Other keyword arguments.

Returns:

groupspd.DataFrame: Returns a dataframe with groups and label column. Dataframe will contain all columns of the original dataframe.

Groups Splitting¶

Examples of using groups splitting tools¶