Groups Splitting

The following classes and functions helps to split batch data into experimental groups using different approaches.

Real-time Splitter availability

The real-time splitting tools are under development. This functionality is intended to be applied to batch data only.

Splitter

Unit for creating experimental groups from batch data.

load_from_config

Restore a Splitter class instance from a yaml config.

split

Function wrapper around the Splitter class.


class ambrosia.splitter.Splitter(dataframe=None, id_column=None, groups_size=None, test_group_ids=None, fit_columns=None, strat_columns=None)[source]

Unit for creating experimental groups from batch data.

Split your data into groups of selected size with respect to:
  • Stratification columns

  • Metric distance of objects in feature space

  • Set of passed ids

Parameters:
dataframePassedDataType, optional

Dataframe or string name of .csv table which contains data used for groups split.

id_columnIdColumnNameType, optional

Name of id column which is used in hash split.

groups_sizeint, optional

Size of the splitted groups.

test_group_idsPeriodColumnNamesType, optional

Ids of objects which are in B(test) group. Used in tasks of post experiment A(control) group pick up.

fit_columnsPeriodColumnNamesType, optional

List of columns names which values will be interpreted as coordinates of points in multidimensional space during metric split.

strat_columnsPeriodColumnNamesType, optional

Columns for stratification. https://en.wikipedia.org/wiki/Stratified_sampling

Attributes:
dataframePassedDataType

Pandas or Spark dataframe with split data.

id_columnIdColumnNameType

Name of id column which is used in hash split.

groups_sizeint

Split size of groups.

test_group_idsPeriodColumnNamesType

Ids of objects which are in B(test) group.

fit_columnsPeriodColumnNamesType

List of columns names used for metric split.

strat_columnsPeriodColumnNamesType

Stratification columns names.

Notes

Main methods for split:

Simple:
  • Randomly chosen groups (via np.random.choice).

Hash:
  • Using hashing of identifiers and distribution by buckets, selects the desired buckets for groups formation.

Metric:
  • For a fixed reference group or a randomly selected one, other groups are selected using the nearest neighbor method (for desired list of columns passed in fit_columns parameter).

Constructors:

>>> # Empty constructor
>>> splitter = Splitter()
>>> # Some data
>>> splitter = Splitter(dataframe=df,
>>>                     id_column='my_id_column',
>>>                     strat_columns=['gender', 'age'],
>>>                     test_group_ids=ids_for_B_group
>>> )

Setters:

>>> splitter.set_dataframe(dataframe)
>>> # You can pass string for pd.read_csv
>>> splitter.set_dataframe('name_of_table.csv')
>>> # Other setters
>>> splitter.set_group_size(1000)
>>> splitter.set_strat_columns(['age', 'region'])

Run:

>>> splitter.run(method='hash', groups_size=10000)
>>> splitter.run(method='metric'
>>>              test_group_ids=b_group,
>>>              id_column='id',
>>>              strat_columns=['age', 'city']
>>>              fit_columns=['metric_history_column', 'other_metric']
>>>              method_meric='fast', # It is used as kwarg
>>>              norm='l2' # It is used as kwarg
>>> )

Load from yaml config:

>>> config = '''
            !splitter # <--- this is yaml tag (important!)
                groups_size:
                    1000
                id_column:
                    id
                strat_columns:
                    - age
                    - country
        '''
>>> splitter = yaml.load(config)
>>> # Or use the implmented function
>>> splitter = load_from_config(config)

Examples

Our development team decided to add onboarding to the mobile app. Already knowing the required group size, we would like to select users for groups A and B respectively. Using the splitter class, this task could be done in the following way:

>>> splitter = Splitter(dataframe=dataframe)
>>> splitter.run(group_size=1000, method='hash', salt='onboarding')

Suppose now, we know that people of different ages and from several countries use our application, so we would like to take this into account during split. To do this, you might use stratification, which can be easily applied by passing only one additional parameter:

>>> splitter = Splitter(data=dataframe, strat_columns=['age', 'country'])
>>> splitter.run(group_size=1000, method='hash', salt='onboarding')

If we have fixed users for the testing group, this can be specified as a parameter:

>>> splitter = Splitter(data=dataframe, strat_columns=['age', 'country'])
>>> splitter.run(method='hash',
>>>              salt='onboarding',
>>>              test_group_ids=B_group_id
>>> )
run(method, dataframe=None, id_column=None, groups_size=None, part_of_table=None, groups_number=2, test_group_ids=None, strat_columns=None, salt=None, fit_columns=None, **kwargs)[source]

Perform a split into groups with selected or saved parameters.

Parameters:
methodstr

Split method, for example "hash".

dataframePassedDataType, optional

Dataframe or string name of .csv table which contains data used for groups split.

id_columnIdColumnNameType, optional

Name of id column which is used in hash split.

groups_sizeint, optional

Size of the splitted groups.

part_of_table: float, optional

Split factor(for group A) for tasks of dataframe full split. If is not None, then overrides groups_size parameter during the split.

groups_numberint, default: 2

Number of groups to be splitted.

test_group_idsPeriodColumnNamesType, optional

Ids of objects which are in B(test) group. Used in tasks of post experiment A(control) group pick up.

strat_columnsPeriodColumnNamesType, optional

Columns for stratification. https://en.wikipedia.org/wiki/Stratified_sampling

saltstr, optional

Salt for hashing in hash-split.

fit_columnsPeriodColumnNamesType, optional

List of columns names which values will be interpreted as coordinates of points in multidimensional space during metric split.

**kwargsDict

Other keyword arguments.

Returns:
groupspd.DataFrame

Returns a dataframe with groups and label column. Dataframe will contain all columns of the original dataframe.

Other Parameters:
threadsint, default1

Number of threads used for calculations.

ambrosia.splitter.load_from_config(yaml_config, loader=<class 'yaml.loader.Loader'>)[source]

Restore a Splitter class instance from a yaml config.

For yaml_config parameter you can pass file name with config, which must ends with .yaml, for example: “config.yaml”. For loader you can choose SafeLoader.

ambrosia.splitter.split(method, dataframe=None, id_column=None, groups_size=None, part_of_table=None, groups_number=2, test_group_ids=None, strat_columns=None, salt=None, fit_columns=None, threads=1, **kwargs)[source]

Function wrapper around the Splitter class.

Used to create splitted groups from the dataframe.

Creates an instance of the Splitter class internally and execute run method with corresponding arguments.

Parameters:
methodstr

Split method, for example "hash".

dataframePassedDataType, optional

Dataframe or string name of .csv table which contains data used for groups split.

id_columnIdColumnNameType, optional

Name of id column which is used in hash split.

groups_sizeint, optional

Size of the splitted groups.

part_of_table: float, optional

Split factor(for group A) for tasks of dataframe full split. If is not None, then overrides groups_size parameter during the split.

groups_numberint, default2

Number of groups to be splitted.

test_group_idsPeriodColumnNamesType, optional

Ids of objects which are in B(test) group. Used in tasks of post experiment A(control) group pick up.

strat_columnsPeriodColumnNamesType, optional

Columns for stratification. https://en.wikipedia.org/wiki/Stratified_sampling

saltstr, optional

Salt for hashing in hash-split.

fit_columnsPeriodColumnNamesType, optional

List of columns names which values will be interpreted as coordinates of points in multidimensional space during metric split.

threadsint, default1

Number of threads used for calculations.

**kwargsDict

Other keyword arguments.

Returns:
groupspd.DataFrame

Returns a dataframe with groups and label column. Dataframe will contain all columns of the original dataframe.

Examples of using groups splitting tools