Chain Preprocessor class overview¶
In this tutorial, we will look at the functionality of the sequential Preprocessor, which combines in its methods most of the data processing classes implemented in Ambrosia.
To demonstrate the capabilities of the class, we will use synthetic data on the time spent by users on video and audio content.
[2]:
import numpy as np
import pandas as pd
from ambrosia.preprocessing import Preprocessor
Load data
[3]:
data = pd.read_csv('../tests/test_data/pipeline_test.csv')
This is daily data for users on a period of a week
[4]:
data.head()
[4]:
| id | gender | watched | audio | day | platform | |
|---|---|---|---|---|---|---|
| 0 | 0 | Male | 7.912889 | 2.210973 | 1 | web |
| 1 | 1 | Male | 6.678690 | 0.020715 | 1 | ios |
| 2 | 2 | Female | 721.434299 | 59.996870 | 1 | ios |
| 3 | 3 | Male | 135.248218 | 18.982887 | 1 | ios |
| 4 | 4 | Female | 38.962917 | 8.324667 | 1 | android |
The Preprocessor class allows one to create custom sequential pipelines that include the steps of data aggregation, outlier removal, and metric transformation. These pipelines can be saved and loaded from files, making them suitable for ongoing data processing.
Let’s create a class instance and pass data to it
[5]:
preprocessor = Preprocessor(dataframe=data, verbose=True)
Now we will apply a number of preprocessing steps: aggregation, outliers removal and CUPED metric transformation for variance reduction
For almost all of the individual data processing classes in Ambrosia, the Preprocessor class has a corresponding method. Check the class documentation to find out their aliases and capabilities.
[6]:
### Set detailed aggregation parameters
agg_params = {
'watched': 'sum',
'audio': 'sum',
'gender': 'simple', # simple - choose the first possible value
'platform': 'mode'
}
[7]:
processed_data = preprocessor.aggregate(groupby_columns='id', agg_params=agg_params)\
.robust(['watched', 'audio'], alpha=0.01, tail='right')\
.cuped('watched', by='audio', transformed_name='watched_cuped') \
.data()
ambrosia LOGGER: Making right-tail robust transformation of columns ['watched', 'audio']
with alphas = [0.01 0.01]
ambrosia LOGGER:
ambrosia LOGGER: Change Mean watched: 5343.8899 ===> 5170.2892
ambrosia LOGGER: Change Variance watched: 10951522.1717 ===> 8739833.1681
ambrosia LOGGER: Change IQR watched: 3958.8107 ===> 3856.7420
ambrosia LOGGER: Change Range watched: 35983.1570 ===> 15681.7113
ambrosia LOGGER:
ambrosia LOGGER: Change Mean audio: 350.3962 ===> 344.7125
ambrosia LOGGER: Change Variance audio: 17724.3973 ===> 15469.6160
ambrosia LOGGER: Change IQR audio: 176.0167 ===> 172.6091
ambrosia LOGGER: Change Range audio: 1098.9677 ===> 683.7463
ambrosia LOGGER: After transformation СUPED for watched, the variance is 7.8360 % of the original
ambrosia LOGGER: Variance transformation 8739833.1681 ===> 684853.6668
Note, that final data() method returns the result data frame.
[8]:
processed_data.head()
[8]:
| id | watched | audio | gender | platform | watched_cuped | |
|---|---|---|---|---|---|---|
| 0 | 0 | 2489.224016 | 213.817130 | Male | web | 5476.097797 |
| 1 | 1 | 3970.775664 | 281.958297 | Male | ios | 5402.751034 |
| 2 | 2 | 5900.186483 | 416.944150 | Female | ios | 4251.949148 |
| 3 | 3 | 5557.860998 | 384.782010 | Male | web | 4643.524511 |
| 4 | 4 | 7588.374990 | 448.263748 | Female | android | 5225.462582 |
Method transformations() allow to get a list of all applied transformations. Parameters of these transformations were fitted when the methods were executed
[9]:
preprocessor.transformations()
[9]:
[<ambrosia.preprocessing.aggregate.AggregatePreprocessor at 0x12f6a67f0>,
<ambrosia.preprocessing.robust.RobustPreprocessor at 0x10433abe0>,
<ambrosia.preprocessing.cuped.Cuped at 0x12f504820>]
For this, the Preprocessor has two methods that allow to save and load fitted transformations: store_transformations() and transform_from_config()
First, let’s store them
[10]:
store_path = '_examples_configs/preprocessor.json'
[11]:
preprocessor.store_transformations(store_path=store_path)
Create new instance with data to process
[12]:
future_preprocessor = Preprocessor(dataframe=data)
Pass a path to stored transformations
[13]:
future_preprocessor.transform_from_config(load_path=store_path)
ambrosia LOGGER: Making right-tail robust transformation of columns ['watched', 'audio']
with alphas = [0.01 0.01]
ambrosia LOGGER:
ambrosia LOGGER: Change Mean watched: 5343.8899 ===> 5170.2892
ambrosia LOGGER: Change Variance watched: 10951522.1717 ===> 8739833.1681
ambrosia LOGGER: Change IQR watched: 3958.8107 ===> 3856.7420
ambrosia LOGGER: Change Range watched: 35983.1570 ===> 15681.7113
ambrosia LOGGER:
ambrosia LOGGER: Change Mean audio: 350.3962 ===> 344.7125
ambrosia LOGGER: Change Variance audio: 17724.3973 ===> 15469.6160
ambrosia LOGGER: Change IQR audio: 176.0167 ===> 172.6091
ambrosia LOGGER: Change Range audio: 1098.9677 ===> 683.7463
ambrosia LOGGER: After transformation СUPED for watched, the variance is 7.8360 % of the original
ambrosia LOGGER: Variance transformation 8739833.1681 ===> 684853.6668
[13]:
| id | watched | audio | gender | platform | watched_cuped | |
|---|---|---|---|---|---|---|
| 0 | 0 | 2489.224016 | 213.817130 | Male | web | 5476.097797 |
| 1 | 1 | 3970.775664 | 281.958297 | Male | ios | 5402.751034 |
| 2 | 2 | 5900.186483 | 416.944150 | Female | ios | 4251.949148 |
| 3 | 3 | 5557.860998 | 384.782010 | Male | web | 4643.524511 |
| 4 | 4 | 7588.374990 | 448.263748 | Female | android | 5225.462582 |
| ... | ... | ... | ... | ... | ... | ... |
| 4995 | 4995 | 1647.603060 | 167.552826 | Male | web | 5690.171914 |
| 4996 | 4996 | 7403.347846 | 423.972130 | Female | android | 5594.740581 |
| 4997 | 4997 | 3243.170373 | 287.159499 | Male | android | 4556.460653 |
| 4998 | 4998 | 12538.349029 | 615.502371 | Female | ios | 6359.254994 |
| 4999 | 4999 | 2302.644537 | 213.418724 | Female | android | 5298.609479 |
4931 rows × 6 columns
Learn more¶
To learn more about the transformations that can be used in the Preprocessor, their functionality and usage
Check:
Preprocessorclass documentationAn overview of Ambrosia main data preprocessing tools
An overview of advanced metric transformation to learn about different methods for reducing variance