Preprocessor¶
Preprocessor class, implementation is based on the chain pattern. |
- class ambrosia.preprocessing.Preprocessor(dataframe, verbose=True)[source]¶
Preprocessor class, implementation is based on the chain pattern.
- Parameters:
- dataframepd.DataFrame
Table with data used for further transformations.
- verbosebool, default:
True If
Truewill print in sys.stdout the information about the variance reduction.
- Attributes:
- dataframepd.DataFrame
Table with data for transformations.
- transformersList of transformations
List of transformation that have been called before.
- verbosebool
Verbose info flag.
Methods
data(copy=True)
Returns a copy or a link for the stored dataframe.
aggregate(groupby_columns, categorial_method, real_method, agg_params,
real_cols, categorial_cols) Aggreagate data by columns.
robust(column_names, alpha=0.05)
Make a robust preprocessing of data.
iqr(column_names, alpha=0.05)
Make an IQR preprocessing of data.
boxcox(column_names, alpha=0.05)
Make a Box-Cox transformation.
log(column_names, alpha=0.05)
Make a log transformation.
cuped(target, by, name, load_path)
Make CUPED transformation for the stored dataframe.
multicuped(target, by, name, load_path)
Make Multi CUPED transformation for the stored dataframe.
transformations()
Returns a list of transformations.
store_transformations(store_path)
Store transformations in a json file.
load_transformations(load_path)
Load transformations from a json file.
apply_transformations()
Apply transformations for the stored dataframe.
transform_from_config(load_path)
Transform inner data frame using pre-saved config file.
Examples
>>> transformer = Preprocessor(dataframe) >>> transformer.aggregate(aggregate_params) >>> .robust(robust_params) >>> .cuped(cuped_params) >>> .data()
- data(copy=True)[source]¶
Return the inner data frame.
Use after all transformations to get transformed data.
- Parameters:
- copybool, default:
True If true returns copy, otherwise link
- copybool, default:
- Returns:
- dataframepd.DataFrame
Table with the modified data after the sequential preprocessing.
- aggregate(groupby_columns=None, categorial_method='mode', real_method='sum', agg_params=None, real_cols=None, categorial_cols=None, load_path=None)[source]¶
Make an aggregation of the dataframe.
- Parameters:
- groupby_columnsList of columns, optional
Columns for GROUP BY.
- categorial_methodtypes.MethodType, default:
"mode" Aggregation method that will be applied for all selected categorial variables.
- real_methodtypes.MethodType, default:
"sum" Aggregation method that will be applied for all selected real variables.
- agg_paramsDict, optional
Dictionary with aggregation parameters.
- real_colstypes.ColumnNamesType, optional
Columns with real metrics. Overriden by
agg_paramsparameter and could be passed if expected default aggregation behavior.- categorial_colstypes.ColumnNamesType, optional
Columns with categorial metrics Overriden by
agg_paramsparameter and could be passed if expected default aggregation behavior.
- Returns:
- selfPreprocessor
Instance object
- robust(column_names=None, alpha=0.05, tail='both', load_path=None)[source]¶
Make a robust preprocessing of the selected columns to remove outliers.
Removes objects from the dataframe which are in the head, end or both tail parts of the selected metrics distributions.
- Parameters:
- column_namesColumnNamesType
One or number of columns in the dataframe.
- alphaUnion[float, np.ndarray], default:
0.05 The percentage of removed data from head and tail.
- tailstr, default:
"both" Part of distribution to be removed. Can be
"left","right"or"both".- load_pathPath, optional
Path to json file with parameters.
- Returns:
- selfPreprocessor
Instance object
- iqr(column_names=None, load_path=None)[source]¶
Make an IQR preprocessing of the selected columns to remove outliers.
Removes objects from the dataframe which are behind boxplot maximum and minimum of the selected metrics distributions.
- Parameters:
- column_namesColumnNamesType, optional
One or number of columns in the dataframe.
- load_pathPath, optional
Path to json file with parameters.
- Returns:
- selfPreprocessor
Instance object
- boxcox(column_names=None, load_path=None)[source]¶
Make a Box-Cox transformation on the selected columns.
Optimal transformation parameters are selected automatically.
- Parameters:
- column_namesColumnNamesType, optional
One or number of columns in the dataframe.
- load_pathPath, optional
Path to json file with parameters.
- Returns:
- selfPreprocessor
Instance object
- log(column_names=None, load_path=None)[source]¶
Make a logarithmic transformation on the selected columns.
- Parameters:
- column_namesColumnNamesType, optional
One or number of columns in the dataframe.
- load_pathPath, optional
Path to json file with parameters.
- Returns:
- selfPreprocessor
Instance object
- cuped(target=None, by=None, transformed_name=None, load_path=None)[source]¶
Make CUPED transformation on the selected column.
- Parameters:
- targetColumnNameType
Column from the dataframe, for which CUPED transformation will be applied.
- byColumnNameType
Covariance column in the dataframe.
- transformed_nametypes.ColumnNameType, optional
Name for the new transformed target column, if is not defined it will be generated automatically.
- load_pathPath, optional
Path to json file with parameters.
- Returns:
- selfPreprocessor
Instance object
- transformations()[source]¶
List of all transformations which were called.
- Returns:
- transformersList[object]
List of executed transformations
- store_transformations(store_path)[source]¶
Store transformations with parameters in the json file.
- Parameters:
- store_pathPath
Path to a json file where transformations will be stored
- load_transformations(load_path)[source]¶
Load pre-saved transformations from the json file.
- Parameters:
- load_pathPath
Path to a json file where transformations are stored