Outliers removal

RobustPreprocessor

Unit for simple robust transformation for avoiding outliers in data.

IQRPreprocessor

Unit for IQR transformation of the data to exclude outliers.

class ambrosia.preprocessing.RobustPreprocessor(verbose=True)[source]

Unit for simple robust transformation for avoiding outliers in data.

It cuts the alpha percentage of distribution from head, tail or both sides for each given metric. The data distribution structure assumed to present as small alpha part of outliers, followed by the normal part of the data with another alpha part of outliers at the end of the distribution.

Parameters:
verbosebool, default: True

If True will show info about the transformation of passed columns.

Attributes:
paramsDict

Dictionary with operational parameters of the instance. Updated after calling the fit method.

verbosebool

Verbose info flag.

available_tailsList

List of the available tail type names to preprocess

non_serializable_params: List

List of the class parameters that should be converted to lists in order to serialize.

fittedbool

Fit flag.

Examples

>>> robust = RobustPreprocessor(verbose=True)
>>> robust.fit(dataframe, ['column1', 'column2'], alpha=0.05)
>>> robust.transform(dataframe, inplace=True)

You can pass one or number of columns, if several columns are passed it will drop in total alpha percent of extreme values for each column.

fit(dataframe, column_names, alpha=0.05, tail='both')[source]

Fit to calculate robust parameters for the selected columns.

Parameters:
dataframepd.DataFrame

Dataframe to calculate quantiles.

column_namesColumnNamesType

One or number of columns in the dataframe.

alphaUnion[float, np.ndarray], default: 0.05

The percentage of removed data from head and tail.

tailstr, default: "both"

Part of distribution to be removed. Can be "left", "right" or "both".

Returns:
selfobject

Instance object.

transform(dataframe, inplace=False)[source]

Remove objects from the dataframe which are in the head, tail or both alpha parts of chosen metrics distributions.

Parameters:
dataframepd.DataFrame

Dataframe to transform.

inplacebool, default: False

If True transforms the given dataframe, otherwise copy and returns an another one.

Returns:
dfUnion[pd.DataFrame, None]

Transformed dataframe or None

fit_transform(dataframe, column_names, alpha=0.05, tail='both', inplace=False)[source]

Fit preprocessor parameters using given dataframe and transform it.

Parameters:
dataframepd.DataFrame

Dataframe to calculate quantiles and for further transformation.

column_namesColumnNamesType

One or number of columns in the dataframe.

alphaUnion[float, np.ndarray], default: 0.05

The percentage of removed data from head and tail.

tailstr, default: "both"

Part of distribution to be removed. Can be "left", "right" or "both".

inplacebool, default: False

If True transforms the given dataframe, otherwise copy and returns an another one.

Returns:
dfUnion[pd.DataFrame, None]

Transformed dataframe or None

store_params(store_path)
Parameters:
store_pathPath

Path where parameters will be stored in a json format.

load_params(load_path)
Parameters:
load_pathPath

Path to json file with parameters.

class ambrosia.preprocessing.IQRPreprocessor(verbose=True)[source]

Unit for IQR transformation of the data to exclude outliers.

It cuts the points from the distribution which are behind the range of 0.25 quantile - 1,5 * iqr and 0.75 quantile + 1,5 * iqr for each given metric.

Parameters:
verbosebool, default: True

If True will show info about the transformation of passed columns.

Attributes:
paramsDict

Dictionary with operational parameters of the instance. Updated after calling the fit method.

verbosebool

Verbose info flag.

non_serializable_params: List

List of the class parameters that should be converted to lists in order to serialize.

fittedbool

Fit flag.

Examples

>>> iqr = IQRPreprocessor(verbose=True)
>>> iqr.fit(dataframe, ['column1', 'column2'])
>>> iqr.transform(dataframe, inplace=True)

You can pass one or number of columns, if several columns are passed it will drop extreme values for each column.

fit(dataframe, column_names)[source]

Fit to calculate iqr parameters for the selected columns.

Parameters:
dataframepd.DataFrame

Dataframe to calculate quantiles.

column_namesColumnNamesType

One or number of columns in the dataframe.

Returns:
selfobject

Instance object.

transform(dataframe, inplace=False)[source]

Remove objects from the dataframe which are behind maximum and minimum values of boxplots for each metric distribution.

Parameters:
dataframepd.DataFrame

Dataframe to transform.

inplacebool, default: False

If True transforms the given dataframe, otherwise copy and returns an another one.

Returns:
dfUnion[pd.DataFrame, None]

Transformed dataframe or None

fit_transform(dataframe, column_names, inplace=False)[source]

Fit preprocessor parameters using given dataframe and transform it.

Parameters:
dataframepd.DataFrame

Dataframe to calculate quantiles and for further transformation.

column_namesColumnNamesType

One or number of columns in the dataframe.

inplacebool, default: False

If True transforms the given dataframe, otherwise copy and returns an another one.

Returns:
dfUnion[pd.DataFrame, None]

Transformed dataframe or None

store_params(store_path)
Parameters:
store_pathPath

Path where parameters will be stored in a json format.

load_params(load_path)
Parameters:
load_pathPath

Path to json file with parameters.