Example of the Splitter class usage for solving group splitting problem

In this tutorial we will use Amrosia splitting tools to create a number of groups using different strategies.

Group splitting problem usually appears in A/B testing when we have designed experiment parameters and want to create experimental groups consist from the objects of the research.

Two different splitting paradigms

Basically, the splitting of objects into groups is divided into batch and real-time split approaches.

For the first type of splitting we precalculate the contents of our experimental groups using, for example, a common database with research objects.
In the second type of splitting approach some tools distribute objects into groups in real time as they arrive, although it may also use some pre-calculated information.

Further in this tutorial we will review the tools for batch splitting tasks.

Note: Ambrosia now supports only batch spliiting. Real-time splitting tools are under development.

Let’s start the tutorial

[1]:
import sys, os
sys.path.insert(1, os.path.realpath(os.path.pardir))
[2]:
import pandas as pd
import numpy as np

import yaml

from ambrosia.splitter import Splitter, split, load_from_config
Your CPU supports instructions that this binary was not compiled to use: AVX2
For maximum performance, you can install NMSLIB from sources
pip install --no-binary :all: nmslib
Generate synthetic data with a number of defferent columns.
We will create 200000 objects with unique id and some numerical features
[3]:
np.random.seed(42)

dataframe = pd.DataFrame({
    'm': np.zeros((200000, )),
    'a': np.random.normal(size=200000),
    'b': np.random.normal(size=200000)
})
dataframe['l'] = np.where(dataframe['a'] > 0, 1, 0)
dataframe['e'] = np.where(dataframe['b'] > 0, 1, 0)
dataframe['object_id'] = np.random.choice(dataframe.index,
                                          size=dataframe.shape[0],
                                          replace=False)
dataframe.head()
[3]:
m a b l e object_id
0 0.0 0.496714 1.561841 1 1 63869
1 0.0 -0.138264 -0.094228 0 0 82374
2 0.0 0.647689 -1.329536 1 0 162918
3 0.0 1.523030 -1.388638 1 0 36327
4 0.0 -0.234153 -0.342651 0 0 91526
[4]:
dataframe.shape
[4]:
(200000, 6)

Now let’s get acquainted with the Splitter class.

The Splitter class is Ambrosia’s main tool for splitting objects into the creating groups. It has one main public method run() which returns the table with a groups of the desired size.

Let’s create an instance of the class and pass to the constructor generated data dataframe about objects (this data is like some abstract user database) which will be used further for the creation of the groups using different methods. We also specify for id_column a column "object_id" that contains unique identifiers of objects. If this column had not been specified, dataframe indexes will be used as identifiers.

[5]:
splitter = Splitter(dataframe=dataframe, id_column='object_id')

As well as in the Designer class, we can pass this dataframe and other parameters later as an argument to the run() method. We can do the same with most of the parameters related directly to the experiment (errors, effects, and so on) - either pass them to the constructor during initialization (and then they will become attributes of the created instance), or pass them later, when execute run() method. In case of parameter selection ambiguity, the argument in the method takes precedence over the attribute value.

Now let’s move on to review different ways to create groups that are implemented in the Splitter class.

Split approaches

Simple split

The first type of splitting strategy is called "simple" and is really about a very simple, non-deterministic way of creating groups, in which a new result is produced each time it is executed.

To create such split we need to execute run() method with corresponding value of method parameter. We will create groups each of size 2000 objects.

[6]:
splitter.run(method='simple', groups_size=2000)
[6]:
m a b l e object_id group
191060 0.0 -0.230298 1.253592 0 1 136859 A
121593 0.0 1.974664 -1.780258 1 0 164797 A
185512 0.0 -1.254767 -0.152099 0 0 49954 A
79803 0.0 -1.572960 -0.706893 0 0 154922 A
98956 0.0 0.714251 0.662607 1 1 99718 A
... ... ... ... ... ... ... ...
53739 0.0 0.070655 0.644952 1 1 62827 B
178405 0.0 -0.423988 -0.706336 0 0 103080 B
95002 0.0 -0.105022 0.714893 0 1 155745 B
166811 0.0 -1.459109 0.339358 0 1 157092 B
41369 0.0 0.721402 -0.980647 1 0 113 B

4000 rows × 7 columns

Hash split

The hash split strategy is based on hashing object identifiers and distributing the resulting hash values into appropriate groups.
This method allows you to perform a deterministic split of objects into groups, also there is no need for a tables with the assigned group labels, because this splitting method allows to restore the labels at any time by re-execution.

To make the splits for each experiment unique, the "salt" parameter is used, which is appended to the end of the identifier of each object. The salt value can be, for example, the name of the experiment being performed.

You can read more about hash-based splitting on the web.

Let’s create a hash split and make sure the result is deterministic

[7]:
groups_size= 5000
salt = 'example_dummy_experiment_2023'

Execute split with pre-defined salt value

[8]:
splitter.run(method='hash', groups_size=groups_size, salt=salt)
[8]:
m a b l e object_id group
14 0.0 -1.724918 -0.350186 0 0 90837 A
44 0.0 -1.478522 0.166608 0 1 123196 A
64 0.0 0.812526 0.914659 1 1 117133 A
65 0.0 1.356240 0.731410 1 1 144787 A
161 0.0 0.787085 -1.012367 1 0 186437 A
... ... ... ... ... ... ... ...
199760 0.0 0.172396 0.844596 1 1 166816 B
199783 0.0 -0.477993 -0.899310 0 0 134168 B
199867 0.0 -1.164759 -0.649031 0 0 41423 B
199868 0.0 0.162848 2.835048 1 1 33513 B
199915 0.0 0.882166 -1.665376 1 0 33638 B

10000 rows × 7 columns

Then get a similar groups for the same salt value

[9]:
splitter.run(method='hash', groups_size=groups_size, salt=salt)
[9]:
m a b l e object_id group
14 0.0 -1.724918 -0.350186 0 0 90837 A
44 0.0 -1.478522 0.166608 0 1 123196 A
64 0.0 0.812526 0.914659 1 1 117133 A
65 0.0 1.356240 0.731410 1 1 144787 A
161 0.0 0.787085 -1.012367 1 0 186437 A
... ... ... ... ... ... ... ...
199760 0.0 0.172396 0.844596 1 1 166816 B
199783 0.0 -0.477993 -0.899310 0 0 134168 B
199867 0.0 -1.164759 -0.649031 0 0 41423 B
199868 0.0 0.162848 2.835048 1 1 33513 B
199915 0.0 0.882166 -1.665376 1 0 33638 B

10000 rows × 7 columns

Split result will be different if the salt is changed

[10]:
splitter.run(method='hash', groups_size=groups_size, salt='salt')
[10]:
m a b l e object_id group
43 0.0 -0.301104 0.440295 0 1 139147 A
192 0.0 0.214094 0.021427 1 1 231 A
226 0.0 0.064280 1.553626 1 1 139761 A
235 0.0 0.633919 -1.277988 1 0 153281 A
285 0.0 -1.952088 1.610653 0 1 36040 A
... ... ... ... ... ... ... ...
199862 0.0 2.035899 0.452816 1 1 34064 B
199949 0.0 0.438721 -0.592572 1 0 99013 B
199970 0.0 0.868163 0.463027 1 1 53783 B
199991 0.0 0.383196 0.230814 1 1 199822 B
199996 0.0 0.565654 -2.316381 1 0 147356 B

10000 rows × 7 columns

If no salt argument is passed, a random value will be generated during the split.

Hash splitting method is fast and convenient and is recommended to use by default.

Metric split

For some tasks, it is very useful to find similar objects and distribute them into groups. For example, we can choose a random object in group A and from the general pool find the closest neighbor to it by some metric and send it to group B. This will make the groups more similar and increase the power of some statistical tests, which is especially valuable for small groups.

This approach is implemented in the "metric" split method, we can specify a set of features using fit_columns parameter, based on which pairs of similar objects will be selected using minimization of the Euclidean distance and distributed between the groups.

We will create two groups using metric split based on two features a and b. Metric split requires sufficient computational resources to find nearest neighbors to set of points equal to size of one group.

[11]:
metric_split = splitter.run(method='metric', groups_size=groups_size, fit_columns=['a', 'b'])
[12]:
metric_split
[12]:
m a b l e object_id group
199994 0.0 -0.590488 -0.518154 0 0 12123 A
80866 0.0 0.436653 -0.458537 1 0 11556 A
128000 0.0 0.448011 -0.555275 1 0 71871 A
95833 0.0 0.514975 1.088812 1 1 149913 A
41929 0.0 -1.537990 -0.270142 0 0 54406 A
... ... ... ... ... ... ... ...
191916 0.0 -1.306423 -0.777014 0 0 13089 B
57853 0.0 -0.490125 1.742080 0 1 159390 B
189321 0.0 -1.759917 0.181625 0 1 153730 B
92099 0.0 -0.972475 0.624865 0 1 13100 B
137630 0.0 -0.175013 0.633727 0 1 23283 B

10000 rows × 7 columns

Currently, pairs of similar objects occupy the same positions in group slices, and that is the only way to find them if you want to inspect individually.

[13]:
metric_split.query("group == 'A'").iloc[0]
[13]:
m                 0.0
a           -0.590488
b           -0.518154
l                   0
e                   0
object_id       12123
group               A
Name: 199994, dtype: object
[14]:
metric_split.query("group == 'B'").iloc[0]
[14]:
m                 0.0
a            -0.59219
b           -0.519087
l                   0
e                   0
object_id      145596
group               B
Name: 111639, dtype: object

Note: Metric split creates pairs (or sets in the case of multiple groups) of dependent objects between groups. This leads to the need to use paired statistical tests.

Stratification

We can sample groups based with stratification.

The stratification technique makes groups more homogeneous and similar to the general population from which these groups were sampled, as well as to reduce the dispersion of metrics in groups. This may be especially usefull in the case of small groups.

To demonstrate let’s choose a binary column for stratification and pass it to strat_columns parameter, and see the ratios of the feature distribution in the case of stratification and without it

[15]:
groups_size = 500
[16]:
stratified_split = splitter.run(method='simple', groups_size=groups_size, strat_columns=['l'])
non_stratified_split = splitter.run(method='simple', groups_size=groups_size)
[17]:
print(f'Initial share of strata: {dataframe["l"].mean() * 100:.1f}%')
Initial share of strata: 50.1%

Share of strata inside the splits

[18]:
print(
    f'Share of strata in groups with stratification: {stratified_split["l"].mean() * 100:.1f}%'
)
print(
    f'Share of strata in groups without stratification: {non_stratified_split["l"].mean() * 100:.1f}%'
)
Share of strata in groups with stratification: 50.1%
Share of strata in groups without stratification: 51.3%

Share of strata inside the groups

[19]:
print('Share of strata in each group with stratification\n',
      np.round(stratified_split.groupby('group')['l'].mean(), 3))
print('\n\nShare of strata in each group without stratification\n',
      np.round(non_stratified_split.groupby('group')['l'].mean(), 3))
Share of strata in each group with stratification
 group
A    0.500
B    0.502
Name: l, dtype: float64


Share of strata in each group without stratification
 group
A    0.514
B    0.512
Name: l, dtype: float64

Multigroup split

Often, two experimental groups are not enough, for example, when we want to test the performance of multiple new recommender system algorithms. For that scenario one may want to make A/B/C/.. split.

In Ambrosia, all functions and methods above are generalized for split into several groups and the number of groups can be controlled using groups_number parameter.

Let’s create 3 groups using metric split

[20]:
metric_multisplit = splitter.run(method='metric',
                                 groups_size=groups_size,
                                 fit_columns=['a', 'b'],
                                 groups_number=3)
[21]:
metric_multisplit.query("group == 'A'").iloc[0]
[21]:
m                 0.0
a            0.612142
b            0.115751
l                   1
e                   1
object_id      182967
group               A
Name: 74537, dtype: object
[22]:
metric_multisplit.query("group == 'B'").iloc[0]
[22]:
m                 0.0
a            0.614548
b            0.116243
l                   1
e                   1
object_id      197300
group               B
Name: 118068, dtype: object
[23]:
metric_multisplit.query("group == 'C'").iloc[0]
[23]:
m                 0.0
a            0.613116
b            0.115762
l                   1
e                   1
object_id       77456
group               C
Name: 150076, dtype: object

And now create 10 groups using hash method

[24]:
hash_multisplit = splitter.run(method='hash',
                               groups_size=1000,
                               groups_number=10)
[25]:
hash_multisplit
[25]:
m a b l e object_id group
16 0.0 -1.012831 0.747465 0 1 155174 A
245 0.0 -0.334501 0.615637 0 1 199630 A
710 0.0 0.211017 1.701546 1 1 156984 A
920 0.0 1.073632 0.596244 1 1 78818 A
1159 0.0 -0.324831 0.547028 0 1 40862 A
... ... ... ... ... ... ... ...
198886 0.0 2.243574 0.782718 1 1 134662 J
199060 0.0 1.154370 0.258987 1 1 195179 J
199812 0.0 -2.566508 0.553087 0 1 165028 J
199946 0.0 0.934797 -1.368926 1 0 25702 J
199960 0.0 0.730843 -0.735387 1 0 35482 J

10000 rows × 7 columns

[26]:
hash_multisplit.group.value_counts()
[26]:
A    1000
B    1000
C    1000
D    1000
E    1000
F    1000
G    1000
H    1000
I    1000
J    1000
Name: group, dtype: int64

Splitting the full table

Sometimes there are scenarios where one need to divide an entire table into groups. At the moment, Ambrosia allows to split data frames into 2 groups using the part_of_table.

We will split passed data frame in a ratio of 1/3 (group A to B) using hash method

[27]:
part_of_table = 1/3
fractional_hash_split = splitter.run(method='hash',
                                     part_of_table=part_of_table,
                                     salt='fractional_split')
[28]:
fractional_hash_split
[28]:
m a b l e object_id group
1 0.0 -0.138264 -0.094228 0 0 82374 A
3 0.0 1.523030 -1.388638 1 0 36327 A
5 0.0 -0.234137 -1.580520 0 0 63304 A
6 0.0 1.579213 0.587148 1 1 187546 A
15 0.0 -0.562288 -1.362157 0 0 133839 A
... ... ... ... ... ... ... ...
199991 0.0 0.383196 0.230814 1 1 199822 B
199994 0.0 -0.590488 -0.518154 0 0 12123 B
199996 0.0 0.565654 -2.316381 1 0 147356 B
199998 0.0 0.855673 0.462531 1 1 132270 B
199999 0.0 -1.064948 -0.137357 0 0 86886 B

200000 rows × 7 columns

[29]:
fractional_hash_split.group.value_counts(normalize=True)
[29]:
B    0.666665
A    0.333335
Name: group, dtype: float64

Selection of an existing group for a test group

Another type of scenario that sometimes occurs in A/B testing tasks, is a problem of post-generation of a control group from the total available pool of objects that were not affected by the treatment.

Although you have to be quite careful in post-analysis of experiments and in post-generation of samples, Ambrosia allows to create control group to the existing test using all methods above.

To do this, it is enough to pass a list of identifiers from the test group to test_group_ids parameter.

[30]:
np.random.seed(42)
group_size = 10000
test_ids = np.random.choice(dataframe.object_id, size=group_size, replace=False)
[31]:
post_hash_split = splitter.run(method='hash',
                               groups_size=groups_size,
                               test_group_ids=test_ids,
                               salt='post-split')
[32]:
post_hash_split
[32]:
m a b l e object_id group
24 0.0 -0.544383 0.242347 0 1 37916 A
94 0.0 -0.392108 0.026810 0 1 12345 A
136 0.0 -0.783253 -1.911507 0 0 94132 A
152 0.0 -0.680025 -0.649503 0 0 87931 A
155 0.0 -0.714351 0.266708 0 1 122498 A
... ... ... ... ... ... ... ...
199945 0.0 2.866851 -0.048362 1 0 146867 B
199963 0.0 -0.935605 0.169802 0 1 116839 B
199976 0.0 -0.811208 -0.931313 0 0 18652 B
199977 0.0 0.547009 0.330221 1 1 180593 B
199997 0.0 0.160020 0.831556 1 1 12296 B

20000 rows × 7 columns

Check that all objects with test ids are in group B and not in A

[33]:
np.isin(test_ids, post_hash_split.query("group == 'A'").object_id).sum()
[33]:
0
[34]:
np.isin(test_ids, post_hash_split.query("group == 'B'").object_id).sum()
[34]:
10000

Storable configuration

Sometimes it is convenient to save the created class instance to a file, so later it can be loaded and reused with preselected attributes. Attributes like datasets are not serialized and must be set after instanse is loaded.
Implemented load_from_config function allows to restore instance directly from .yaml file

Let’s create an instance with the parameters we want to save in a file

[35]:
store_path = '_examples_configs/splitter_config.yaml'
[36]:
storable_splitter = Splitter(id_column='object_id',
                             groups_size=322,
                             strat_columns=['l', 'e'])
[37]:
storable_splitter.__getstate__()
[37]:
{'id_column': 'object_id',
 'groups_size': 322,
 'fit_columns': None,
 'strat_columns': ['l', 'e']}

Save as the .yaml file

[38]:
with open(store_path, "w") as outfile:
    yaml.dump(storable_splitter, outfile, default_flow_style=False)

Load from the file

[39]:
loaded_splitter = load_from_config(store_path)
[40]:
loaded_splitter.__getstate__()
[40]:
{'id_column': 'object_id',
 'groups_size': 322,
 'fit_columns': None,
 'strat_columns': ['l', 'e']}

Set dataframe and make some split

[41]:
loaded_splitter.set_dataframe(dataframe)
[42]:
loaded_splitter.run(method='hash', salt='from_yaml')
[42]:
m a b l e object_id group
2198 0.0 -0.641487 -0.658645 0 0 20742 A
2642 0.0 -0.847634 -0.043161 0 0 120722 A
7596 0.0 -1.572989 -1.090749 0 0 24930 A
7688 0.0 -0.233898 -1.196993 0 0 5767 A
10737 0.0 -1.000271 -0.298924 0 0 187434 A
... ... ... ... ... ... ... ...
197562 0.0 2.832918 0.327783 1 1 41784 B
132857 0.0 0.286866 -0.218729 1 0 11805 A
36947 0.0 -0.893407 -0.847359 0 0 175180 A
27596 0.0 1.459311 1.463153 1 1 65061 B
162879 0.0 0.702267 -0.490351 1 0 35838 B

644 rows × 7 columns

Stand-alone split function

Ambrosia contains the split function that replicates the behavior of the Splitter class and can also be used for the same split tasks

[43]:
split(method='simple',
      dataframe=dataframe,
      id_column='object_id',
      groups_size=1000)
[43]:
m a b l e object_id group
196900 0.0 0.050390 0.819410 1 1 14646 A
43323 0.0 -3.393915 -0.281103 0 0 36721 A
171194 0.0 -0.177053 0.248580 0 1 156469 A
185518 0.0 0.757116 0.366774 1 1 53901 A
10289 0.0 -0.751167 2.058094 0 1 134394 A
... ... ... ... ... ... ... ...
69994 0.0 -1.329803 -0.023193 0 0 24351 B
114139 0.0 0.714640 1.330547 1 1 99421 B
142263 0.0 -0.590689 -0.951325 0 0 97706 B
86716 0.0 -0.601845 0.501994 0 1 83325 B
108454 0.0 0.835601 0.454933 1 1 18123 B

2000 rows × 7 columns


Learn more

There is some more information about groups split using Ambrosia

Check:

  • Splitter class documentation

  • An example of splitting groups from a Spark DataFrame (currently has limited functionality)