Chain `Preprocessor` class overview¶

In this tutorial, we will look at the functionality of the sequential Preprocessor, which combines in its methods most of the data processing classes implemented in Ambrosia.

To demonstrate the capabilities of the class, we will use synthetic data on the time spent by users on video and audio content.

[2]:

import numpy as np
import pandas as pd

from ambrosia.preprocessing import Preprocessor

Load data

[3]:

data = pd.read_csv('../tests/test_data/pipeline_test.csv')

This is daily data for users on a period of a week

[4]:

data.head()

[4]:

	id	gender	watched	audio	day	platform
0	0	Male	7.912889	2.210973	1	web
1	1	Male	6.678690	0.020715	1	ios
2	2	Female	721.434299	59.996870	1	ios
3	3	Male	135.248218	18.982887	1	ios
4	4	Female	38.962917	8.324667	1	android

The Preprocessor class allows one to create custom sequential pipelines that include the steps of data aggregation, outlier removal, and metric transformation. These pipelines can be saved and loaded from files, making them suitable for ongoing data processing.

Let’s create a class instance and pass data to it

[5]:

preprocessor = Preprocessor(dataframe=data, verbose=True)

Now we will apply a number of preprocessing steps: aggregation, outliers removal and CUPED metric transformation for variance reduction

For almost all of the individual data processing classes in Ambrosia, the Preprocessor class has a corresponding method. Check the class documentation to find out their aliases and capabilities.

[6]:

### Set detailed aggregation parameters
agg_params = {
    'watched': 'sum',
    'audio': 'sum',
    'gender': 'simple',  # simple - choose the first possible value
    'platform': 'mode'
}

[7]:

processed_data = preprocessor.aggregate(groupby_columns='id', agg_params=agg_params)\
                  .robust(['watched', 'audio'], alpha=0.01, tail='right')\
                  .cuped('watched', by='audio', transformed_name='watched_cuped') \
                  .data()

ambrosia LOGGER: Making right-tail robust transformation of columns ['watched', 'audio']
                 with alphas = [0.01 0.01]
ambrosia LOGGER:

ambrosia LOGGER: Change Mean watched: 5343.8899 ===> 5170.2892
ambrosia LOGGER: Change Variance watched: 10951522.1717 ===> 8739833.1681
ambrosia LOGGER: Change IQR watched: 3958.8107 ===> 3856.7420
ambrosia LOGGER: Change Range watched: 35983.1570 ===> 15681.7113
ambrosia LOGGER:

ambrosia LOGGER: Change Mean audio: 350.3962 ===> 344.7125
ambrosia LOGGER: Change Variance audio: 17724.3973 ===> 15469.6160
ambrosia LOGGER: Change IQR audio: 176.0167 ===> 172.6091
ambrosia LOGGER: Change Range audio: 1098.9677 ===> 683.7463
ambrosia LOGGER: After transformation СUPED for watched, the variance is 7.8360 % of the original
ambrosia LOGGER: Variance transformation 8739833.1681 ===> 684853.6668

Note, that final data() method returns the result data frame.

[8]:

processed_data.head()

[8]:

	id	watched	audio	gender	platform	watched_cuped
0	0	2489.224016	213.817130	Male	web	5476.097797
1	1	3970.775664	281.958297	Male	ios	5402.751034
2	2	5900.186483	416.944150	Female	ios	4251.949148
3	3	5557.860998	384.782010	Male	web	4643.524511
4	4	7588.374990	448.263748	Female	android	5225.462582

Method transformations() allow to get a list of all applied transformations. Parameters of these transformations were fitted when the methods were executed

[9]:

preprocessor.transformations()

[9]:

[<ambrosia.preprocessing.aggregate.AggregatePreprocessor at 0x12f6a67f0>,
 <ambrosia.preprocessing.robust.RobustPreprocessor at 0x10433abe0>,
 <ambrosia.preprocessing.cuped.Cuped at 0x12f504820>]

For many scenarios, it is useful to store executed transformations with fitted parameters for future use.

For example, we may have some continuous batch data that we would like to transform, or we are waiting for some A/B test to finish and we need to process the data with the same pre-experimental parameters.

For this, the Preprocessor has two methods that allow to save and load fitted transformations: store_transformations() and transform_from_config()

First, let’s store them

[10]:

store_path = '_examples_configs/preprocessor.json'

[11]:

preprocessor.store_transformations(store_path=store_path)

Now imagine that in the future we would like to process the data using these stored transformations.

For simplicity, we will use the same data

Create new instance with data to process

[12]:

future_preprocessor = Preprocessor(dataframe=data)

Pass a path to stored transformations

[13]:

future_preprocessor.transform_from_config(load_path=store_path)

ambrosia LOGGER: Making right-tail robust transformation of columns ['watched', 'audio']
                 with alphas = [0.01 0.01]
ambrosia LOGGER:

ambrosia LOGGER: Change Mean watched: 5343.8899 ===> 5170.2892
ambrosia LOGGER: Change Variance watched: 10951522.1717 ===> 8739833.1681
ambrosia LOGGER: Change IQR watched: 3958.8107 ===> 3856.7420
ambrosia LOGGER: Change Range watched: 35983.1570 ===> 15681.7113
ambrosia LOGGER:

ambrosia LOGGER: Change Mean audio: 350.3962 ===> 344.7125
ambrosia LOGGER: Change Variance audio: 17724.3973 ===> 15469.6160
ambrosia LOGGER: Change IQR audio: 176.0167 ===> 172.6091
ambrosia LOGGER: Change Range audio: 1098.9677 ===> 683.7463
ambrosia LOGGER: After transformation СUPED for watched, the variance is 7.8360 % of the original
ambrosia LOGGER: Variance transformation 8739833.1681 ===> 684853.6668

[13]:

	id	watched	audio	gender	platform	watched_cuped
0	0	2489.224016	213.817130	Male	web	5476.097797
1	1	3970.775664	281.958297	Male	ios	5402.751034
2	2	5900.186483	416.944150	Female	ios	4251.949148
3	3	5557.860998	384.782010	Male	web	4643.524511
4	4	7588.374990	448.263748	Female	android	5225.462582
...	...	...	...	...	...	...
4995	4995	1647.603060	167.552826	Male	web	5690.171914
4996	4996	7403.347846	423.972130	Female	android	5594.740581
4997	4997	3243.170373	287.159499	Male	android	4556.460653
4998	4998	12538.349029	615.502371	Female	ios	6359.254994
4999	4999	2302.644537	213.418724	Female	android	5298.609479

4931 rows × 6 columns

Learn more¶

To learn more about the transformations that can be used in the Preprocessor, their functionality and usage

Check:

Preprocessor class documentation
An overview of Ambrosia main data preprocessing tools
An overview of advanced metric transformation to learn about different methods for reducing variance

Chain Preprocessor class overview¶

Learn more¶

Chain `Preprocessor` class overview¶