Groups Splitting¶
The following classes and functions helps to split batch data into experimental groups using different approaches.
Real-time Splitter availability
The real-time splitting tools are under development. This functionality is intended to be applied to batch data only.
Unit for creating experimental groups from batch data. |
|
Restore a |
|
Function wrapper around the |
- class ambrosia.splitter.Splitter(dataframe=None, id_column=None, groups_size=None, test_group_ids=None, fit_columns=None, strat_columns=None)[source]¶
Unit for creating experimental groups from batch data.
- Split your data into groups of selected size with respect to:
Stratification columns
Metric distance of objects in feature space
Set of passed ids
- Parameters:
- dataframePassedDataType, optional
Dataframe or string name of .csv table which contains data used for groups split.
- id_columnIdColumnNameType, optional
Name of id column which is used in hash split.
- groups_sizeint, optional
Size of the splitted groups.
- test_group_idsPeriodColumnNamesType, optional
Ids of objects which are in B(test) group. Used in tasks of post experiment A(control) group pick up.
- fit_columnsPeriodColumnNamesType, optional
List of columns names which values will be interpreted as coordinates of points in multidimensional space during metric split.
- strat_columnsPeriodColumnNamesType, optional
Columns for stratification. https://en.wikipedia.org/wiki/Stratified_sampling
- Attributes:
- dataframePassedDataType
Pandas or Spark dataframe with split data.
- id_columnIdColumnNameType
Name of id column which is used in hash split.
- groups_sizeint
Split size of groups.
- test_group_idsPeriodColumnNamesType
Ids of objects which are in B(test) group.
- fit_columnsPeriodColumnNamesType
List of columns names used for metric split.
- strat_columnsPeriodColumnNamesType
Stratification columns names.
Notes
Main methods for split:
- Simple:
Randomly chosen groups (via
np.random.choice).
- Hash:
Using hashing of identifiers and distribution by buckets, selects the desired buckets for groups formation.
- Metric:
For a fixed reference group or a randomly selected one, other groups are selected using the nearest neighbor method (for desired list of columns passed in
fit_columnsparameter).
Constructors:
>>> # Empty constructor >>> splitter = Splitter() >>> # Some data >>> splitter = Splitter(dataframe=df, >>> id_column='my_id_column', >>> strat_columns=['gender', 'age'], >>> test_group_ids=ids_for_B_group >>> )
Setters:
>>> splitter.set_dataframe(dataframe) >>> # You can pass string for pd.read_csv >>> splitter.set_dataframe('name_of_table.csv') >>> # Other setters >>> splitter.set_group_size(1000) >>> splitter.set_strat_columns(['age', 'region'])
Run:
>>> splitter.run(method='hash', groups_size=10000) >>> splitter.run(method='metric' >>> test_group_ids=b_group, >>> id_column='id', >>> strat_columns=['age', 'city'] >>> fit_columns=['metric_history_column', 'other_metric'] >>> method_meric='fast', # It is used as kwarg >>> norm='l2' # It is used as kwarg >>> )
Load from yaml config:
>>> config = ''' !splitter # <--- this is yaml tag (important!) groups_size: 1000 id_column: id strat_columns: - age - country ''' >>> splitter = yaml.load(config) >>> # Or use the implmented function >>> splitter = load_from_config(config)
Examples
Our development team decided to add onboarding to the mobile app. Already knowing the required group size, we would like to select users for groups A and B respectively. Using the splitter class, this task could be done in the following way:
>>> splitter = Splitter(dataframe=dataframe) >>> splitter.run(group_size=1000, method='hash', salt='onboarding')
Suppose now, we know that people of different ages and from several countries use our application, so we would like to take this into account during split. To do this, you might use stratification, which can be easily applied by passing only one additional parameter:
>>> splitter = Splitter(data=dataframe, strat_columns=['age', 'country']) >>> splitter.run(group_size=1000, method='hash', salt='onboarding')
If we have fixed users for the testing group, this can be specified as a parameter:
>>> splitter = Splitter(data=dataframe, strat_columns=['age', 'country']) >>> splitter.run(method='hash', >>> salt='onboarding', >>> test_group_ids=B_group_id >>> )
- run(method, dataframe=None, id_column=None, groups_size=None, part_of_table=None, groups_number=2, test_group_ids=None, strat_columns=None, salt=None, fit_columns=None, **kwargs)[source]¶
Perform a split into groups with selected or saved parameters.
- Parameters:
- methodstr
Split method, for example
"hash".- dataframePassedDataType, optional
Dataframe or string name of .csv table which contains data used for groups split.
- id_columnIdColumnNameType, optional
Name of id column which is used in hash split.
- groups_sizeint, optional
Size of the splitted groups.
- part_of_table: float, optional
Split factor(for group A) for tasks of dataframe full split. If is not
None, then overridesgroups_sizeparameter during the split.- groups_numberint, default:
2 Number of groups to be splitted.
- test_group_idsPeriodColumnNamesType, optional
Ids of objects which are in B(test) group. Used in tasks of post experiment A(control) group pick up.
- strat_columnsPeriodColumnNamesType, optional
Columns for stratification. https://en.wikipedia.org/wiki/Stratified_sampling
- saltstr, optional
Salt for hashing in hash-split.
- fit_columnsPeriodColumnNamesType, optional
List of columns names which values will be interpreted as coordinates of points in multidimensional space during metric split.
- **kwargsDict
Other keyword arguments.
- Returns:
- groupspd.DataFrame
Returns a dataframe with groups and label column. Dataframe will contain all columns of the original dataframe.
- Other Parameters:
- threadsint, default
1 Number of threads used for calculations.
- threadsint, default
- ambrosia.splitter.load_from_config(yaml_config, loader=<class 'yaml.loader.Loader'>)[source]¶
Restore a
Splitterclass instance from a yaml config.For yaml_config parameter you can pass file name with config, which must ends with .yaml, for example: “config.yaml”. For loader you can choose SafeLoader.
- ambrosia.splitter.split(method, dataframe=None, id_column=None, groups_size=None, part_of_table=None, groups_number=2, test_group_ids=None, strat_columns=None, salt=None, fit_columns=None, threads=1, **kwargs)[source]¶
Function wrapper around the
Splitterclass.Used to create splitted groups from the dataframe.
Creates an instance of the
Splitterclass internally and execute run method with corresponding arguments.- Parameters:
- methodstr
Split method, for example
"hash".- dataframePassedDataType, optional
Dataframe or string name of .csv table which contains data used for groups split.
- id_columnIdColumnNameType, optional
Name of id column which is used in hash split.
- groups_sizeint, optional
Size of the splitted groups.
- part_of_table: float, optional
Split factor(for group A) for tasks of dataframe full split. If is not
None, then overridesgroups_sizeparameter during the split.- groups_numberint, default
2 Number of groups to be splitted.
- test_group_idsPeriodColumnNamesType, optional
Ids of objects which are in B(test) group. Used in tasks of post experiment A(control) group pick up.
- strat_columnsPeriodColumnNamesType, optional
Columns for stratification. https://en.wikipedia.org/wiki/Stratified_sampling
- saltstr, optional
Salt for hashing in hash-split.
- fit_columnsPeriodColumnNamesType, optional
List of columns names which values will be interpreted as coordinates of points in multidimensional space during metric split.
- threadsint, default
1 Number of threads used for calculations.
- **kwargsDict
Other keyword arguments.
- Returns:
- groupspd.DataFrame
Returns a dataframe with groups and label column. Dataframe will contain all columns of the original dataframe.