Preprocessor

Preprocessor

Preprocessor class, implementation is based on the chain pattern.

class ambrosia.preprocessing.Preprocessor(dataframe, verbose=True)[source]

Preprocessor class, implementation is based on the chain pattern.

Parameters:
dataframepd.DataFrame

Table with data used for further transformations.

verbosebool, default: True

If True will print in sys.stdout the information about the variance reduction.

Attributes:
dataframepd.DataFrame

Table with data for transformations.

transformersList of transformations

List of transformation that have been called before.

verbosebool

Verbose info flag.

Methods

data(copy=True)

Returns a copy or a link for the stored dataframe.

aggregate(groupby_columns, categorial_method, real_method, agg_params,

real_cols, categorial_cols) Aggreagate data by columns.

robust(column_names, alpha=0.05)

Make a robust preprocessing of data.

iqr(column_names, alpha=0.05)

Make an IQR preprocessing of data.

boxcox(column_names, alpha=0.05)

Make a Box-Cox transformation.

log(column_names, alpha=0.05)

Make a log transformation.

cuped(target, by, name, load_path)

Make CUPED transformation for the stored dataframe.

multicuped(target, by, name, load_path)

Make Multi CUPED transformation for the stored dataframe.

transformations()

Returns a list of transformations.

store_transformations(store_path)

Store transformations in a json file.

load_transformations(load_path)

Load transformations from a json file.

apply_transformations()

Apply transformations for the stored dataframe.

transform_from_config(load_path)

Transform inner data frame using pre-saved config file.

Examples

>>> transformer = Preprocessor(dataframe)
>>> transformer.aggregate(aggregate_params)
>>>            .robust(robust_params)
>>>            .cuped(cuped_params)
>>>            .data()
data(copy=True)[source]

Return the inner data frame.

Use after all transformations to get transformed data.

Parameters:
copybool, default: True

If true returns copy, otherwise link

Returns:
dataframepd.DataFrame

Table with the modified data after the sequential preprocessing.

aggregate(groupby_columns=None, categorial_method='mode', real_method='sum', agg_params=None, real_cols=None, categorial_cols=None, load_path=None)[source]

Make an aggregation of the dataframe.

Parameters:
groupby_columnsList of columns, optional

Columns for GROUP BY.

categorial_methodtypes.MethodType, default: "mode"

Aggregation method that will be applied for all selected categorial variables.

real_methodtypes.MethodType, default: "sum"

Aggregation method that will be applied for all selected real variables.

agg_paramsDict, optional

Dictionary with aggregation parameters.

real_colstypes.ColumnNamesType, optional

Columns with real metrics. Overriden by agg_params parameter and could be passed if expected default aggregation behavior.

categorial_colstypes.ColumnNamesType, optional

Columns with categorial metrics Overriden by agg_params parameter and could be passed if expected default aggregation behavior.

Returns:
selfPreprocessor

Instance object

robust(column_names=None, alpha=0.05, tail='both', load_path=None)[source]

Make a robust preprocessing of the selected columns to remove outliers.

Removes objects from the dataframe which are in the head, end or both tail parts of the selected metrics distributions.

Parameters:
column_namesColumnNamesType

One or number of columns in the dataframe.

alphaUnion[float, np.ndarray], default: 0.05

The percentage of removed data from head and tail.

tailstr, default: "both"

Part of distribution to be removed. Can be "left", "right" or "both".

load_pathPath, optional

Path to json file with parameters.

Returns:
selfPreprocessor

Instance object

iqr(column_names=None, load_path=None)[source]

Make an IQR preprocessing of the selected columns to remove outliers.

Removes objects from the dataframe which are behind boxplot maximum and minimum of the selected metrics distributions.

Parameters:
column_namesColumnNamesType, optional

One or number of columns in the dataframe.

load_pathPath, optional

Path to json file with parameters.

Returns:
selfPreprocessor

Instance object

boxcox(column_names=None, load_path=None)[source]

Make a Box-Cox transformation on the selected columns.

Optimal transformation parameters are selected automatically.

Parameters:
column_namesColumnNamesType, optional

One or number of columns in the dataframe.

load_pathPath, optional

Path to json file with parameters.

Returns:
selfPreprocessor

Instance object

log(column_names=None, load_path=None)[source]

Make a logarithmic transformation on the selected columns.

Parameters:
column_namesColumnNamesType, optional

One or number of columns in the dataframe.

load_pathPath, optional

Path to json file with parameters.

Returns:
selfPreprocessor

Instance object

cuped(target=None, by=None, transformed_name=None, load_path=None)[source]

Make CUPED transformation on the selected column.

Parameters:
targetColumnNameType

Column from the dataframe, for which CUPED transformation will be applied.

byColumnNameType

Covariance column in the dataframe.

transformed_nametypes.ColumnNameType, optional

Name for the new transformed target column, if is not defined it will be generated automatically.

load_pathPath, optional

Path to json file with parameters.

Returns:
selfPreprocessor

Instance object

transformations()[source]

List of all transformations which were called.

Returns:
transformersList[object]

List of executed transformations

store_transformations(store_path)[source]

Store transformations with parameters in the json file.

Parameters:
store_pathPath

Path to a json file where transformations will be stored

load_transformations(load_path)[source]

Load pre-saved transformations from the json file.

Parameters:
load_pathPath

Path to a json file where transformations are stored

apply_transformations()[source]

Apply all transformations to the inner data frame.

Returns:
dataframepd.DataFrame

Transformed inner data frame

transform_from_config(load_path)[source]

Run transformations from the config file on the internal data frame.

Parameters:
load_pathPath

Path to a json file where transformations are stored.

Returns:
dataframepd.DataFrame

Transformed inner data frame