Preprocessor¶

Preprocessor

Preprocessor class, implementation is based on the chain pattern.

class ambrosia.preprocessing.Preprocessor(dataframe, verbose=True)[source]¶

Preprocessor class, implementation is based on the chain pattern.

Parameters:

dataframepd.DataFrame: Table with data used for further transformations.
verbosebool, default: True: If True will print in sys.stdout the information about the variance reduction.

Attributes:

dataframepd.DataFrame: Table with data for transformations.
transformersList of transformations: List of transformation that have been called before.
verbosebool: Verbose info flag.

Methods

data(copy=True)	Returns a copy or a link for the stored dataframe.
aggregate(groupby_columns, categorial_method, real_method, agg_params,	real_cols, categorial_cols) Aggreagate data by columns.
robust(column_names, alpha=0.05)	Make a robust preprocessing of data.
iqr(column_names, alpha=0.05)	Make an IQR preprocessing of data.
boxcox(column_names, alpha=0.05)	Make a Box-Cox transformation.
log(column_names, alpha=0.05)	Make a log transformation.
cuped(target, by, name, load_path)	Make CUPED transformation for the stored dataframe.
multicuped(target, by, name, load_path)	Make Multi CUPED transformation for the stored dataframe.
transformations()	Returns a list of transformations.
store_transformations(store_path)	Store transformations in a json file.
load_transformations(load_path)	Load transformations from a json file.
apply_transformations()	Apply transformations for the stored dataframe.
transform_from_config(load_path)	Transform inner data frame using pre-saved config file.

Examples

>>> transformer = Preprocessor(dataframe)
>>> transformer.aggregate(aggregate_params)
>>>            .robust(robust_params)
>>>            .cuped(cuped_params)
>>>            .data()

data(copy=True)[source]¶

Return the inner data frame.

Use after all transformations to get transformed data.

Parameters:

copybool, default: True: If true returns copy, otherwise link

Returns:

dataframepd.DataFrame: Table with the modified data after the sequential preprocessing.

aggregate(groupby_columns=None, categorial_method='mode', real_method='sum', agg_params=None, real_cols=None, categorial_cols=None, load_path=None)[source]¶

Make an aggregation of the dataframe.

Parameters:

groupby_columnsList of columns, optional: Columns for GROUP BY.
categorial_methodtypes.MethodType, default: "mode": Aggregation method that will be applied for all selected categorial variables.
real_methodtypes.MethodType, default: "sum": Aggregation method that will be applied for all selected real variables.
agg_paramsDict, optional: Dictionary with aggregation parameters.
real_colstypes.ColumnNamesType, optional: Columns with real metrics. Overriden by agg_params parameter and could be passed if expected default aggregation behavior.
categorial_colstypes.ColumnNamesType, optional: Columns with categorial metrics Overriden by agg_params parameter and could be passed if expected default aggregation behavior.

Returns:

selfPreprocessor: Instance object

robust(column_names=None, alpha=0.05, tail='both', load_path=None)[source]¶

Make a robust preprocessing of the selected columns to remove outliers.

Removes objects from the dataframe which are in the head, end or both tail parts of the selected metrics distributions.

Parameters:

column_namesColumnNamesType: One or number of columns in the dataframe.
alphaUnion[float, np.ndarray], default: 0.05: The percentage of removed data from head and tail.
tailstr, default: "both": Part of distribution to be removed. Can be "left", "right" or "both".
load_pathPath, optional: Path to json file with parameters.

Returns:

selfPreprocessor: Instance object

iqr(column_names=None, load_path=None)[source]¶

Make an IQR preprocessing of the selected columns to remove outliers.

Removes objects from the dataframe which are behind boxplot maximum and minimum of the selected metrics distributions.

Parameters:

column_namesColumnNamesType, optional: One or number of columns in the dataframe.
load_pathPath, optional: Path to json file with parameters.

Returns:

selfPreprocessor: Instance object

boxcox(column_names=None, load_path=None)[source]¶

Make a Box-Cox transformation on the selected columns.

Optimal transformation parameters are selected automatically.

Parameters:

column_namesColumnNamesType, optional: One or number of columns in the dataframe.
load_pathPath, optional: Path to json file with parameters.

Returns:

selfPreprocessor: Instance object

log(column_names=None, load_path=None)[source]¶

Make a logarithmic transformation on the selected columns.

Parameters:

column_namesColumnNamesType, optional: One or number of columns in the dataframe.
load_pathPath, optional: Path to json file with parameters.

Returns:

selfPreprocessor: Instance object

cuped(target=None, by=None, transformed_name=None, load_path=None)[source]¶

Make CUPED transformation on the selected column.

Parameters:

targetColumnNameType: Column from the dataframe, for which CUPED transformation will be applied.
byColumnNameType: Covariance column in the dataframe.
transformed_nametypes.ColumnNameType, optional: Name for the new transformed target column, if is not defined it will be generated automatically.
load_pathPath, optional: Path to json file with parameters.

Returns:

selfPreprocessor: Instance object

transformations()[source]¶

List of all transformations which were called.

Returns:

transformersList[object]: List of executed transformations

store_transformations(store_path)[source]¶

Store transformations with parameters in the json file.

Parameters:

store_pathPath: Path to a json file where transformations will be stored

load_transformations(load_path)[source]¶

Load pre-saved transformations from the json file.

Parameters:

load_pathPath: Path to a json file where transformations are stored

apply_transformations()[source]¶

Apply all transformations to the inner data frame.

Returns:

dataframepd.DataFrame: Transformed inner data frame

transform_from_config(load_path)[source]¶

Run transformations from the config file on the internal data frame.

Parameters:

load_pathPath: Path to a json file where transformations are stored.

Returns:

dataframepd.DataFrame: Transformed inner data frame