Chain Preprocessor class overview

In this tutorial, we will look at the functionality of the sequential Preprocessor, which combines in its methods most of the data processing classes implemented in Ambrosia.

To demonstrate the capabilities of the class, we will use synthetic data on the time spent by users on video and audio content.

[2]:
import numpy as np
import pandas as pd

from ambrosia.preprocessing import Preprocessor

Load data

[3]:
data = pd.read_csv('../tests/test_data/pipeline_test.csv')

This is daily data for users on a period of a week

[4]:
data.head()
[4]:
id gender watched audio day platform
0 0 Male 7.912889 2.210973 1 web
1 1 Male 6.678690 0.020715 1 ios
2 2 Female 721.434299 59.996870 1 ios
3 3 Male 135.248218 18.982887 1 ios
4 4 Female 38.962917 8.324667 1 android

The Preprocessor class allows one to create custom sequential pipelines that include the steps of data aggregation, outlier removal, and metric transformation. These pipelines can be saved and loaded from files, making them suitable for ongoing data processing.

Let’s create a class instance and pass data to it

[5]:
preprocessor = Preprocessor(dataframe=data, verbose=True)

Now we will apply a number of preprocessing steps: aggregation, outliers removal and CUPED metric transformation for variance reduction

For almost all of the individual data processing classes in Ambrosia, the Preprocessor class has a corresponding method. Check the class documentation to find out their aliases and capabilities.

[6]:
### Set detailed aggregation parameters
agg_params = {
    'watched': 'sum',
    'audio': 'sum',
    'gender': 'simple',  # simple - choose the first possible value
    'platform': 'mode'
}
[7]:
processed_data = preprocessor.aggregate(groupby_columns='id', agg_params=agg_params)\
                  .robust(['watched', 'audio'], alpha=0.01, tail='right')\
                  .cuped('watched', by='audio', transformed_name='watched_cuped') \
                  .data()
ambrosia LOGGER: Making right-tail robust transformation of columns ['watched', 'audio']
                 with alphas = [0.01 0.01]
ambrosia LOGGER:

ambrosia LOGGER: Change Mean watched: 5343.8899 ===> 5170.2892
ambrosia LOGGER: Change Variance watched: 10951522.1717 ===> 8739833.1681
ambrosia LOGGER: Change IQR watched: 3958.8107 ===> 3856.7420
ambrosia LOGGER: Change Range watched: 35983.1570 ===> 15681.7113
ambrosia LOGGER:

ambrosia LOGGER: Change Mean audio: 350.3962 ===> 344.7125
ambrosia LOGGER: Change Variance audio: 17724.3973 ===> 15469.6160
ambrosia LOGGER: Change IQR audio: 176.0167 ===> 172.6091
ambrosia LOGGER: Change Range audio: 1098.9677 ===> 683.7463
ambrosia LOGGER: After transformation СUPED for watched, the variance is 7.8360 % of the original
ambrosia LOGGER: Variance transformation 8739833.1681 ===> 684853.6668

Note, that final data() method returns the result data frame.

[8]:
processed_data.head()
[8]:
id watched audio gender platform watched_cuped
0 0 2489.224016 213.817130 Male web 5476.097797
1 1 3970.775664 281.958297 Male ios 5402.751034
2 2 5900.186483 416.944150 Female ios 4251.949148
3 3 5557.860998 384.782010 Male web 4643.524511
4 4 7588.374990 448.263748 Female android 5225.462582

Method transformations() allow to get a list of all applied transformations. Parameters of these transformations were fitted when the methods were executed

[9]:
preprocessor.transformations()
[9]:
[<ambrosia.preprocessing.aggregate.AggregatePreprocessor at 0x12f6a67f0>,
 <ambrosia.preprocessing.robust.RobustPreprocessor at 0x10433abe0>,
 <ambrosia.preprocessing.cuped.Cuped at 0x12f504820>]
For many scenarios, it is useful to store executed transformations with fitted parameters for future use.
For example, we may have some continuous batch data that we would like to transform, or we are waiting for some A/B test to finish and we need to process the data with the same pre-experimental parameters.

For this, the Preprocessor has two methods that allow to save and load fitted transformations: store_transformations() and transform_from_config()

First, let’s store them

[10]:
store_path = '_examples_configs/preprocessor.json'
[11]:
preprocessor.store_transformations(store_path=store_path)
Now imagine that in the future we would like to process the data using these stored transformations.
For simplicity, we will use the same data

Create new instance with data to process

[12]:
future_preprocessor = Preprocessor(dataframe=data)

Pass a path to stored transformations

[13]:
future_preprocessor.transform_from_config(load_path=store_path)
ambrosia LOGGER: Making right-tail robust transformation of columns ['watched', 'audio']
                 with alphas = [0.01 0.01]
ambrosia LOGGER:

ambrosia LOGGER: Change Mean watched: 5343.8899 ===> 5170.2892
ambrosia LOGGER: Change Variance watched: 10951522.1717 ===> 8739833.1681
ambrosia LOGGER: Change IQR watched: 3958.8107 ===> 3856.7420
ambrosia LOGGER: Change Range watched: 35983.1570 ===> 15681.7113
ambrosia LOGGER:

ambrosia LOGGER: Change Mean audio: 350.3962 ===> 344.7125
ambrosia LOGGER: Change Variance audio: 17724.3973 ===> 15469.6160
ambrosia LOGGER: Change IQR audio: 176.0167 ===> 172.6091
ambrosia LOGGER: Change Range audio: 1098.9677 ===> 683.7463
ambrosia LOGGER: After transformation СUPED for watched, the variance is 7.8360 % of the original
ambrosia LOGGER: Variance transformation 8739833.1681 ===> 684853.6668
[13]:
id watched audio gender platform watched_cuped
0 0 2489.224016 213.817130 Male web 5476.097797
1 1 3970.775664 281.958297 Male ios 5402.751034
2 2 5900.186483 416.944150 Female ios 4251.949148
3 3 5557.860998 384.782010 Male web 4643.524511
4 4 7588.374990 448.263748 Female android 5225.462582
... ... ... ... ... ... ...
4995 4995 1647.603060 167.552826 Male web 5690.171914
4996 4996 7403.347846 423.972130 Female android 5594.740581
4997 4997 3243.170373 287.159499 Male android 4556.460653
4998 4998 12538.349029 615.502371 Female ios 6359.254994
4999 4999 2302.644537 213.418724 Female android 5298.609479

4931 rows × 6 columns


Learn more

To learn more about the transformations that can be used in the Preprocessor, their functionality and usage

Check:

  • Preprocessor class documentation

  • An overview of Ambrosia main data preprocessing tools

  • An overview of advanced metric transformation to learn about different methods for reducing variance