12 Preprocessing

Pre-processing or feature engineering is a crucial step to take in every machine learning project. Some models require dummy variables instead of categorical ones, or normalized numerical variables, or also NA values to be imputed.

In tidymodels, every data pre-processing operation can be specified trough a step_ function of the recipes package. You start by defining a recipe object in which you specify the predictors and response variable as well as the data. Then, you add incrementally the step_ functions, which transform accordingly your data.

In Scikit-Learn, every pre-processor is a class object …

preprocessor <- recipe(heart_disease ~ ., data = train_data) |> 
    step_normalize(all_numeric_predictors()) |> 
    step_dummy(all_nominal_predictors(), one_hot = FALSE)

preprocessor |> 
    prep() |> 
    bake(new_data = NULL) |> 
    head()

# A tibble: 6 × 38
      bmi physical_health mental_health sleep_time heart_disease smoking_Yes
    <dbl>           <dbl>         <dbl>      <dbl> <fct>               <dbl>
1 -1.84           -0.0470         3.28     -1.46   No                      1
2 -1.26           -0.424         -0.490    -0.0681 No                      0
3 -0.275           2.09           3.28      0.626  No                      1
4 -0.648          -0.424         -0.490    -0.762  No                      0
5  0.0846          0.330         -0.490     3.40   Yes                     1
6 -1.05            1.46          -0.490    -2.15   No                      0
# … with 32 more variables: alcohol_drinking_Yes <dbl>, stroke_Yes <dbl>,
#   diff_walking_Yes <dbl>, sex_Male <dbl>, age_category_X25.29 <dbl>,
#   age_category_X30.34 <dbl>, age_category_X35.39 <dbl>,
#   age_category_X40.44 <dbl>, age_category_X45.49 <dbl>,
#   age_category_X50.54 <dbl>, age_category_X55.59 <dbl>,
#   age_category_X60.64 <dbl>, age_category_X65.69 <dbl>,
#   age_category_X70.74 <dbl>, age_category_X75.79 <dbl>, …

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_selector, ColumnTransformer

select_cat_cols = make_column_selector(dtype_include=object)
select_num_cols = make_column_selector(dtype_exclude=object)

preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(drop="first", sparse=False), select_cat_cols(train_data)),
    ('num', StandardScaler(), select_num_cols(train_data))
])

pd.DataFrame(
    preprocessor.fit_transform(train_data),
    columns=preprocessor.get_feature_names_out(preprocessor.feature_names_in_)
).head()

   cat__smoking_Yes  ...  num__sleep_time
0               1.0  ...        -0.065988
1               1.0  ...        -0.763366
2               0.0  ...         0.631390
3               0.0  ...         0.631390
4               1.0  ...        -0.065988

[5 rows x 37 columns]

12.2 Preprocessors

You can find all the available recipes here. For Scikit-Learn, you can see the official API pages.

12.2.1 Normalization

z-score

R
Python

step_normalize()
step_center()
step_scale()

from sklearn.preprocessing import StandardScaler

StandardScaler()
StandardScaler(with_std=False)
StandardScaler(with_mean=False)

min-max

R
Python

step_range()
step_range(min = -1, max = 1)

from sklearn.preprocessing import MinMaxScaler

MinMaxScalerScaler()
MinMaxScalerScaler(feature_range=(-1, 1))

Gaussian-like

R
Python

step_YeoJohnson()
step_BoxCox()

from sklearn.preprocessing import PowerTransformer

PowerTransformer(method="yeo-johnson")
PowerTransformer(method="box-cox")

12.2.2 Binning

R
Python

step_discretize(num_breaks = 4)

from sklearn.preprocessing import KBinsDiscretizer

KBinsDiscretizer(n_bins=5, encode="ordinal")

12.2.3 Dummy variables

R
Python

step_dummy(one_hot = TRUE)
step_dummy()

from sklearn.preprocessing import OneHotEncoder

OneHotEncoder()
OneHotEncoder(drop="first")

12.2.4 Imputation

R
Python

step_impute_mean()
step_impute_median()
step_impute_mode()

step_impute_knn(neighbors = 5)

from sklearn.impute import SimpleImputer, KNNImputer

SimpleImputer(strategy="mean")
SimpleImputer(strategy="median")
SimpleImputer(strategy="most_frequent")

KNNImputer(n_neighbors=5)

12.2.5 Augmentation

R
Python

step_poly(degree=2)
step_bs(deg_free = 8, degree = 3)

from sklearn.preprocessing import PolynomialFeatures, SplineTransformer

PolynomialFeatures(degree=2)
SplineTransformer(n_knots=5, degree=3, knots="quantile")

12.2.6 Apply functions

R
Python

step_log()
step_log(base = 10)

step_hyperbolic(func = "sin")
step_hyperbolic(func = "cos")
step_hyperbolic(func = "tan")

step_sqrt()

import numpy as np
from sklearn.preprocessing import FunctionTransformer

FunctionTransformer(func=np.log)
FunctionTransformer(func=np.log10)

FunctionTransformer(func=np.sin)
FunctionTransformer(func=np.cos)
FunctionTransformer(func=np.tan)

FunctionTransformer(func=np.sqrt)

12.2.7 Zero-variance

R
Python

step_zv()

from sklearn.feature_selection import VarianceThreshold

VarianceThreshold(threshold=0)