Example of the Splitter class usage for solving group splitting problem¶
In this tutorial we will use Amrosia splitting tools to create a number of groups using different strategies.
Group splitting problem usually appears in A/B testing when we have designed experiment parameters and want to create experimental groups consist from the objects of the research.
Two different splitting paradigms¶
Basically, the splitting of objects into groups is divided into batch and real-time split approaches.
Further in this tutorial we will review the tools for batch splitting tasks.
Note: Ambrosia now supports only batch spliiting. Real-time splitting tools are under development.
Let’s start the tutorial¶
[1]:
import sys, os
sys.path.insert(1, os.path.realpath(os.path.pardir))
[2]:
import pandas as pd
import numpy as np
import yaml
from ambrosia.splitter import Splitter, split, load_from_config
Your CPU supports instructions that this binary was not compiled to use: AVX2
For maximum performance, you can install NMSLIB from sources
pip install --no-binary :all: nmslib
[3]:
np.random.seed(42)
dataframe = pd.DataFrame({
'm': np.zeros((200000, )),
'a': np.random.normal(size=200000),
'b': np.random.normal(size=200000)
})
dataframe['l'] = np.where(dataframe['a'] > 0, 1, 0)
dataframe['e'] = np.where(dataframe['b'] > 0, 1, 0)
dataframe['object_id'] = np.random.choice(dataframe.index,
size=dataframe.shape[0],
replace=False)
dataframe.head()
[3]:
| m | a | b | l | e | object_id | |
|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.496714 | 1.561841 | 1 | 1 | 63869 |
| 1 | 0.0 | -0.138264 | -0.094228 | 0 | 0 | 82374 |
| 2 | 0.0 | 0.647689 | -1.329536 | 1 | 0 | 162918 |
| 3 | 0.0 | 1.523030 | -1.388638 | 1 | 0 | 36327 |
| 4 | 0.0 | -0.234153 | -0.342651 | 0 | 0 | 91526 |
[4]:
dataframe.shape
[4]:
(200000, 6)
Now let’s get acquainted with the Splitter class.
The Splitter class is Ambrosia’s main tool for splitting objects into the creating groups. It has one main public method run() which returns the table with a groups of the desired size.
Let’s create an instance of the class and pass to the constructor generated data dataframe about objects (this data is like some abstract user database) which will be used further for the creation of the groups using different methods. We also specify for id_column a column "object_id" that contains unique identifiers of objects. If this column had not been specified, dataframe indexes will be used as identifiers.
[5]:
splitter = Splitter(dataframe=dataframe, id_column='object_id')
As well as in the Designer class, we can pass this dataframe and other parameters later as an argument to the run() method. We can do the same with most of the parameters related directly to the experiment (errors, effects, and so on) - either pass them to the constructor during initialization (and then they will become attributes of the created instance), or pass them later, when execute run() method. In case of parameter selection ambiguity, the argument in the method takes
precedence over the attribute value.
Now let’s move on to review different ways to create groups that are implemented in the Splitter class.
Split approaches¶
Simple split¶
The first type of splitting strategy is called "simple" and is really about a very simple, non-deterministic way of creating groups, in which a new result is produced each time it is executed.
To create such split we need to execute run() method with corresponding value of method parameter. We will create groups each of size 2000 objects.
[6]:
splitter.run(method='simple', groups_size=2000)
[6]:
| m | a | b | l | e | object_id | group | |
|---|---|---|---|---|---|---|---|
| 191060 | 0.0 | -0.230298 | 1.253592 | 0 | 1 | 136859 | A |
| 121593 | 0.0 | 1.974664 | -1.780258 | 1 | 0 | 164797 | A |
| 185512 | 0.0 | -1.254767 | -0.152099 | 0 | 0 | 49954 | A |
| 79803 | 0.0 | -1.572960 | -0.706893 | 0 | 0 | 154922 | A |
| 98956 | 0.0 | 0.714251 | 0.662607 | 1 | 1 | 99718 | A |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 53739 | 0.0 | 0.070655 | 0.644952 | 1 | 1 | 62827 | B |
| 178405 | 0.0 | -0.423988 | -0.706336 | 0 | 0 | 103080 | B |
| 95002 | 0.0 | -0.105022 | 0.714893 | 0 | 1 | 155745 | B |
| 166811 | 0.0 | -1.459109 | 0.339358 | 0 | 1 | 157092 | B |
| 41369 | 0.0 | 0.721402 | -0.980647 | 1 | 0 | 113 | B |
4000 rows × 7 columns
Hash split¶
To make the splits for each experiment unique, the "salt" parameter is used, which is appended to the end of the identifier of each object. The salt value can be, for example, the name of the experiment being performed.
You can read more about hash-based splitting on the web.
Let’s create a hash split and make sure the result is deterministic
[7]:
groups_size= 5000
salt = 'example_dummy_experiment_2023'
Execute split with pre-defined salt value
[8]:
splitter.run(method='hash', groups_size=groups_size, salt=salt)
[8]:
| m | a | b | l | e | object_id | group | |
|---|---|---|---|---|---|---|---|
| 14 | 0.0 | -1.724918 | -0.350186 | 0 | 0 | 90837 | A |
| 44 | 0.0 | -1.478522 | 0.166608 | 0 | 1 | 123196 | A |
| 64 | 0.0 | 0.812526 | 0.914659 | 1 | 1 | 117133 | A |
| 65 | 0.0 | 1.356240 | 0.731410 | 1 | 1 | 144787 | A |
| 161 | 0.0 | 0.787085 | -1.012367 | 1 | 0 | 186437 | A |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 199760 | 0.0 | 0.172396 | 0.844596 | 1 | 1 | 166816 | B |
| 199783 | 0.0 | -0.477993 | -0.899310 | 0 | 0 | 134168 | B |
| 199867 | 0.0 | -1.164759 | -0.649031 | 0 | 0 | 41423 | B |
| 199868 | 0.0 | 0.162848 | 2.835048 | 1 | 1 | 33513 | B |
| 199915 | 0.0 | 0.882166 | -1.665376 | 1 | 0 | 33638 | B |
10000 rows × 7 columns
Then get a similar groups for the same salt value
[9]:
splitter.run(method='hash', groups_size=groups_size, salt=salt)
[9]:
| m | a | b | l | e | object_id | group | |
|---|---|---|---|---|---|---|---|
| 14 | 0.0 | -1.724918 | -0.350186 | 0 | 0 | 90837 | A |
| 44 | 0.0 | -1.478522 | 0.166608 | 0 | 1 | 123196 | A |
| 64 | 0.0 | 0.812526 | 0.914659 | 1 | 1 | 117133 | A |
| 65 | 0.0 | 1.356240 | 0.731410 | 1 | 1 | 144787 | A |
| 161 | 0.0 | 0.787085 | -1.012367 | 1 | 0 | 186437 | A |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 199760 | 0.0 | 0.172396 | 0.844596 | 1 | 1 | 166816 | B |
| 199783 | 0.0 | -0.477993 | -0.899310 | 0 | 0 | 134168 | B |
| 199867 | 0.0 | -1.164759 | -0.649031 | 0 | 0 | 41423 | B |
| 199868 | 0.0 | 0.162848 | 2.835048 | 1 | 1 | 33513 | B |
| 199915 | 0.0 | 0.882166 | -1.665376 | 1 | 0 | 33638 | B |
10000 rows × 7 columns
Split result will be different if the salt is changed
[10]:
splitter.run(method='hash', groups_size=groups_size, salt='salt')
[10]:
| m | a | b | l | e | object_id | group | |
|---|---|---|---|---|---|---|---|
| 43 | 0.0 | -0.301104 | 0.440295 | 0 | 1 | 139147 | A |
| 192 | 0.0 | 0.214094 | 0.021427 | 1 | 1 | 231 | A |
| 226 | 0.0 | 0.064280 | 1.553626 | 1 | 1 | 139761 | A |
| 235 | 0.0 | 0.633919 | -1.277988 | 1 | 0 | 153281 | A |
| 285 | 0.0 | -1.952088 | 1.610653 | 0 | 1 | 36040 | A |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 199862 | 0.0 | 2.035899 | 0.452816 | 1 | 1 | 34064 | B |
| 199949 | 0.0 | 0.438721 | -0.592572 | 1 | 0 | 99013 | B |
| 199970 | 0.0 | 0.868163 | 0.463027 | 1 | 1 | 53783 | B |
| 199991 | 0.0 | 0.383196 | 0.230814 | 1 | 1 | 199822 | B |
| 199996 | 0.0 | 0.565654 | -2.316381 | 1 | 0 | 147356 | B |
10000 rows × 7 columns
If no salt argument is passed, a random value will be generated during the split.
Hash splitting method is fast and convenient and is recommended to use by default.
Metric split¶
For some tasks, it is very useful to find similar objects and distribute them into groups. For example, we can choose a random object in group A and from the general pool find the closest neighbor to it by some metric and send it to group B. This will make the groups more similar and increase the power of some statistical tests, which is especially valuable for small groups.
This approach is implemented in the "metric" split method, we can specify a set of features using fit_columns parameter, based on which pairs of similar objects will be selected using minimization of the Euclidean distance and distributed between the groups.
We will create two groups using metric split based on two features a and b. Metric split requires sufficient computational resources to find nearest neighbors to set of points equal to size of one group.
[11]:
metric_split = splitter.run(method='metric', groups_size=groups_size, fit_columns=['a', 'b'])
[12]:
metric_split
[12]:
| m | a | b | l | e | object_id | group | |
|---|---|---|---|---|---|---|---|
| 199994 | 0.0 | -0.590488 | -0.518154 | 0 | 0 | 12123 | A |
| 80866 | 0.0 | 0.436653 | -0.458537 | 1 | 0 | 11556 | A |
| 128000 | 0.0 | 0.448011 | -0.555275 | 1 | 0 | 71871 | A |
| 95833 | 0.0 | 0.514975 | 1.088812 | 1 | 1 | 149913 | A |
| 41929 | 0.0 | -1.537990 | -0.270142 | 0 | 0 | 54406 | A |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 191916 | 0.0 | -1.306423 | -0.777014 | 0 | 0 | 13089 | B |
| 57853 | 0.0 | -0.490125 | 1.742080 | 0 | 1 | 159390 | B |
| 189321 | 0.0 | -1.759917 | 0.181625 | 0 | 1 | 153730 | B |
| 92099 | 0.0 | -0.972475 | 0.624865 | 0 | 1 | 13100 | B |
| 137630 | 0.0 | -0.175013 | 0.633727 | 0 | 1 | 23283 | B |
10000 rows × 7 columns
Currently, pairs of similar objects occupy the same positions in group slices, and that is the only way to find them if you want to inspect individually.
[13]:
metric_split.query("group == 'A'").iloc[0]
[13]:
m 0.0
a -0.590488
b -0.518154
l 0
e 0
object_id 12123
group A
Name: 199994, dtype: object
[14]:
metric_split.query("group == 'B'").iloc[0]
[14]:
m 0.0
a -0.59219
b -0.519087
l 0
e 0
object_id 145596
group B
Name: 111639, dtype: object
Note: Metric split creates pairs (or sets in the case of multiple groups) of dependent objects between groups. This leads to the need to use paired statistical tests.
Stratification¶
We can sample groups based with stratification.
The stratification technique makes groups more homogeneous and similar to the general population from which these groups were sampled, as well as to reduce the dispersion of metrics in groups. This may be especially usefull in the case of small groups.
To demonstrate let’s choose a binary column for stratification and pass it to strat_columns parameter, and see the ratios of the feature distribution in the case of stratification and without it
[15]:
groups_size = 500
[16]:
stratified_split = splitter.run(method='simple', groups_size=groups_size, strat_columns=['l'])
non_stratified_split = splitter.run(method='simple', groups_size=groups_size)
[17]:
print(f'Initial share of strata: {dataframe["l"].mean() * 100:.1f}%')
Initial share of strata: 50.1%
Share of strata inside the splits
[18]:
print(
f'Share of strata in groups with stratification: {stratified_split["l"].mean() * 100:.1f}%'
)
print(
f'Share of strata in groups without stratification: {non_stratified_split["l"].mean() * 100:.1f}%'
)
Share of strata in groups with stratification: 50.1%
Share of strata in groups without stratification: 51.3%
Share of strata inside the groups
[19]:
print('Share of strata in each group with stratification\n',
np.round(stratified_split.groupby('group')['l'].mean(), 3))
print('\n\nShare of strata in each group without stratification\n',
np.round(non_stratified_split.groupby('group')['l'].mean(), 3))
Share of strata in each group with stratification
group
A 0.500
B 0.502
Name: l, dtype: float64
Share of strata in each group without stratification
group
A 0.514
B 0.512
Name: l, dtype: float64
Multigroup split¶
Often, two experimental groups are not enough, for example, when we want to test the performance of multiple new recommender system algorithms. For that scenario one may want to make A/B/C/.. split.
In Ambrosia, all functions and methods above are generalized for split into several groups and the number of groups can be controlled using groups_number parameter.
Let’s create 3 groups using metric split
[20]:
metric_multisplit = splitter.run(method='metric',
groups_size=groups_size,
fit_columns=['a', 'b'],
groups_number=3)
[21]:
metric_multisplit.query("group == 'A'").iloc[0]
[21]:
m 0.0
a 0.612142
b 0.115751
l 1
e 1
object_id 182967
group A
Name: 74537, dtype: object
[22]:
metric_multisplit.query("group == 'B'").iloc[0]
[22]:
m 0.0
a 0.614548
b 0.116243
l 1
e 1
object_id 197300
group B
Name: 118068, dtype: object
[23]:
metric_multisplit.query("group == 'C'").iloc[0]
[23]:
m 0.0
a 0.613116
b 0.115762
l 1
e 1
object_id 77456
group C
Name: 150076, dtype: object
And now create 10 groups using hash method
[24]:
hash_multisplit = splitter.run(method='hash',
groups_size=1000,
groups_number=10)
[25]:
hash_multisplit
[25]:
| m | a | b | l | e | object_id | group | |
|---|---|---|---|---|---|---|---|
| 16 | 0.0 | -1.012831 | 0.747465 | 0 | 1 | 155174 | A |
| 245 | 0.0 | -0.334501 | 0.615637 | 0 | 1 | 199630 | A |
| 710 | 0.0 | 0.211017 | 1.701546 | 1 | 1 | 156984 | A |
| 920 | 0.0 | 1.073632 | 0.596244 | 1 | 1 | 78818 | A |
| 1159 | 0.0 | -0.324831 | 0.547028 | 0 | 1 | 40862 | A |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 198886 | 0.0 | 2.243574 | 0.782718 | 1 | 1 | 134662 | J |
| 199060 | 0.0 | 1.154370 | 0.258987 | 1 | 1 | 195179 | J |
| 199812 | 0.0 | -2.566508 | 0.553087 | 0 | 1 | 165028 | J |
| 199946 | 0.0 | 0.934797 | -1.368926 | 1 | 0 | 25702 | J |
| 199960 | 0.0 | 0.730843 | -0.735387 | 1 | 0 | 35482 | J |
10000 rows × 7 columns
[26]:
hash_multisplit.group.value_counts()
[26]:
A 1000
B 1000
C 1000
D 1000
E 1000
F 1000
G 1000
H 1000
I 1000
J 1000
Name: group, dtype: int64
Splitting the full table¶
Sometimes there are scenarios where one need to divide an entire table into groups. At the moment, Ambrosia allows to split data frames into 2 groups using the part_of_table.
We will split passed data frame in a ratio of 1/3 (group A to B) using hash method
[27]:
part_of_table = 1/3
fractional_hash_split = splitter.run(method='hash',
part_of_table=part_of_table,
salt='fractional_split')
[28]:
fractional_hash_split
[28]:
| m | a | b | l | e | object_id | group | |
|---|---|---|---|---|---|---|---|
| 1 | 0.0 | -0.138264 | -0.094228 | 0 | 0 | 82374 | A |
| 3 | 0.0 | 1.523030 | -1.388638 | 1 | 0 | 36327 | A |
| 5 | 0.0 | -0.234137 | -1.580520 | 0 | 0 | 63304 | A |
| 6 | 0.0 | 1.579213 | 0.587148 | 1 | 1 | 187546 | A |
| 15 | 0.0 | -0.562288 | -1.362157 | 0 | 0 | 133839 | A |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 199991 | 0.0 | 0.383196 | 0.230814 | 1 | 1 | 199822 | B |
| 199994 | 0.0 | -0.590488 | -0.518154 | 0 | 0 | 12123 | B |
| 199996 | 0.0 | 0.565654 | -2.316381 | 1 | 0 | 147356 | B |
| 199998 | 0.0 | 0.855673 | 0.462531 | 1 | 1 | 132270 | B |
| 199999 | 0.0 | -1.064948 | -0.137357 | 0 | 0 | 86886 | B |
200000 rows × 7 columns
[29]:
fractional_hash_split.group.value_counts(normalize=True)
[29]:
B 0.666665
A 0.333335
Name: group, dtype: float64
Selection of an existing group for a test group¶
Another type of scenario that sometimes occurs in A/B testing tasks, is a problem of post-generation of a control group from the total available pool of objects that were not affected by the treatment.
Although you have to be quite careful in post-analysis of experiments and in post-generation of samples, Ambrosia allows to create control group to the existing test using all methods above.
To do this, it is enough to pass a list of identifiers from the test group to test_group_ids parameter.
[30]:
np.random.seed(42)
group_size = 10000
test_ids = np.random.choice(dataframe.object_id, size=group_size, replace=False)
[31]:
post_hash_split = splitter.run(method='hash',
groups_size=groups_size,
test_group_ids=test_ids,
salt='post-split')
[32]:
post_hash_split
[32]:
| m | a | b | l | e | object_id | group | |
|---|---|---|---|---|---|---|---|
| 24 | 0.0 | -0.544383 | 0.242347 | 0 | 1 | 37916 | A |
| 94 | 0.0 | -0.392108 | 0.026810 | 0 | 1 | 12345 | A |
| 136 | 0.0 | -0.783253 | -1.911507 | 0 | 0 | 94132 | A |
| 152 | 0.0 | -0.680025 | -0.649503 | 0 | 0 | 87931 | A |
| 155 | 0.0 | -0.714351 | 0.266708 | 0 | 1 | 122498 | A |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 199945 | 0.0 | 2.866851 | -0.048362 | 1 | 0 | 146867 | B |
| 199963 | 0.0 | -0.935605 | 0.169802 | 0 | 1 | 116839 | B |
| 199976 | 0.0 | -0.811208 | -0.931313 | 0 | 0 | 18652 | B |
| 199977 | 0.0 | 0.547009 | 0.330221 | 1 | 1 | 180593 | B |
| 199997 | 0.0 | 0.160020 | 0.831556 | 1 | 1 | 12296 | B |
20000 rows × 7 columns
Check that all objects with test ids are in group B and not in A
[33]:
np.isin(test_ids, post_hash_split.query("group == 'A'").object_id).sum()
[33]:
0
[34]:
np.isin(test_ids, post_hash_split.query("group == 'B'").object_id).sum()
[34]:
10000
Storable configuration¶
load_from_config function allows to restore instance directly from .yaml fileLet’s create an instance with the parameters we want to save in a file
[35]:
store_path = '_examples_configs/splitter_config.yaml'
[36]:
storable_splitter = Splitter(id_column='object_id',
groups_size=322,
strat_columns=['l', 'e'])
[37]:
storable_splitter.__getstate__()
[37]:
{'id_column': 'object_id',
'groups_size': 322,
'fit_columns': None,
'strat_columns': ['l', 'e']}
Save as the .yaml file
[38]:
with open(store_path, "w") as outfile:
yaml.dump(storable_splitter, outfile, default_flow_style=False)
Load from the file
[39]:
loaded_splitter = load_from_config(store_path)
[40]:
loaded_splitter.__getstate__()
[40]:
{'id_column': 'object_id',
'groups_size': 322,
'fit_columns': None,
'strat_columns': ['l', 'e']}
Set dataframe and make some split
[41]:
loaded_splitter.set_dataframe(dataframe)
[42]:
loaded_splitter.run(method='hash', salt='from_yaml')
[42]:
| m | a | b | l | e | object_id | group | |
|---|---|---|---|---|---|---|---|
| 2198 | 0.0 | -0.641487 | -0.658645 | 0 | 0 | 20742 | A |
| 2642 | 0.0 | -0.847634 | -0.043161 | 0 | 0 | 120722 | A |
| 7596 | 0.0 | -1.572989 | -1.090749 | 0 | 0 | 24930 | A |
| 7688 | 0.0 | -0.233898 | -1.196993 | 0 | 0 | 5767 | A |
| 10737 | 0.0 | -1.000271 | -0.298924 | 0 | 0 | 187434 | A |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 197562 | 0.0 | 2.832918 | 0.327783 | 1 | 1 | 41784 | B |
| 132857 | 0.0 | 0.286866 | -0.218729 | 1 | 0 | 11805 | A |
| 36947 | 0.0 | -0.893407 | -0.847359 | 0 | 0 | 175180 | A |
| 27596 | 0.0 | 1.459311 | 1.463153 | 1 | 1 | 65061 | B |
| 162879 | 0.0 | 0.702267 | -0.490351 | 1 | 0 | 35838 | B |
644 rows × 7 columns
Stand-alone split function¶
Ambrosia contains the split function that replicates the behavior of the Splitter class and can also be used for the same split tasks
[43]:
split(method='simple',
dataframe=dataframe,
id_column='object_id',
groups_size=1000)
[43]:
| m | a | b | l | e | object_id | group | |
|---|---|---|---|---|---|---|---|
| 196900 | 0.0 | 0.050390 | 0.819410 | 1 | 1 | 14646 | A |
| 43323 | 0.0 | -3.393915 | -0.281103 | 0 | 0 | 36721 | A |
| 171194 | 0.0 | -0.177053 | 0.248580 | 0 | 1 | 156469 | A |
| 185518 | 0.0 | 0.757116 | 0.366774 | 1 | 1 | 53901 | A |
| 10289 | 0.0 | -0.751167 | 2.058094 | 0 | 1 | 134394 | A |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 69994 | 0.0 | -1.329803 | -0.023193 | 0 | 0 | 24351 | B |
| 114139 | 0.0 | 0.714640 | 1.330547 | 1 | 1 | 99421 | B |
| 142263 | 0.0 | -0.590689 | -0.951325 | 0 | 0 | 97706 | B |
| 86716 | 0.0 | -0.601845 | 0.501994 | 0 | 1 | 83325 | B |
| 108454 | 0.0 | 0.835601 | 0.454933 | 1 | 1 | 18123 | B |
2000 rows × 7 columns
Learn more¶
There is some more information about groups split using Ambrosia
Check:
Splitterclass documentationAn example of splitting groups from a Spark DataFrame (currently has limited functionality)