.. _outlier_trimmer: .. currentmodule:: feature_engine.outliers OutlierTrimmer ============== The :class:`OutlierTrimmer()` removes values beyond an automatically generated minimum and/or maximum values. The minimum and maximum values can be calculated in 1 of 3 ways: Gaussian limits: - right tail: mean + 3* std - left tail: mean - 3* std IQR limits: - right tail: 75th quantile + 3* IQR - left tail: 25th quantile - 3* IQR where IQR is the inter-quartile range: 75th quantile - 25th quantile. MAD limits: - right tail: median + 3* MAD - left tail: median - 3* MAD where MAD is the median absoulte deviation from the median. percentiles or quantiles: - right tail: 95th percentile - left tail: 5th percentile **Example** Let's remove some outliers in the Titanic Dataset. First, let's load the data and separate it into train and test: .. code:: python from sklearn.model_selection import train_test_split from feature_engine.datasets import load_titanic from feature_engine.outliers import OutlierTrimmer X, y = load_titanic( return_X_y_frame=True, predictors_only=True, handle_missing=True, ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=0, ) print(X_train.head()) We see the resulting data below: .. code:: python pclass sex age sibsp parch fare cabin embarked 501 2 female 13.000000 0 1 19.5000 Missing S 588 2 female 4.000000 1 1 23.0000 Missing S 402 2 female 30.000000 1 0 13.8583 Missing C 1193 3 male 29.881135 0 0 7.7250 Missing Q 686 3 female 22.000000 0 0 7.7250 Missing Q Now, we will set the :class:`OutlierTrimmer()` to remove outliers at the right side of the distribution only (param `tail`). We want the maximum values to be determined using the 75th quantile of the variable (param `capping_method`) plus 1.5 times the IQR (param `fold`). And we only want to cap outliers in 2 variables, which we indicate in a list. .. code:: python capper = OutlierTrimmer(capping_method='iqr', tail='right', fold=1.5, variables=['age', 'fare']) capper.fit(X_train) With `fit()`, the :class:`OutlierTrimmer()` finds the values at which it should cap the variables. These values are stored in its attribute: .. code:: python capper.right_tail_caps_ .. code:: python {'age': 53.0, 'fare': 66.34379999999999} We can now go ahead and remove the outliers: .. code:: python train_t = capper.transform(X_train) test_t = capper.transform(X_test) If we evaluate now the maximum of the variables in the transformed datasets, they should be <= the values observed in the attribute `right_tail_caps_`: .. code:: python train_t[['fare', 'age']].max() .. code:: python fare 65.0 age 53.0 dtype: float64 More details ^^^^^^^^^^^^ You can find more details about the :class:`OutlierTrimmer()` functionality in the following notebook: - `Jupyter notebook `_ For more details about this and other feature engineering methods check out these resources: - `Feature engineering for machine learning `_, online course. - `Python Feature Engineering Cookbook `_, book.