12  Preprocessing

Pre-processing or feature engineering is a crucial step to take in every machine learning project. Some models require dummy variables instead of categorical ones, or normalized numerical variables, or also NA values to be imputed.

In tidymodels, every data pre-processing operation can be specified trough a step_ function of the recipes package. You start by defining a recipe object in which you specify the predictors and response variable as well as the data. Then, you add incrementally the step_ functions, which transform accordingly your data.

In Scikit-Learn, every pre-processor is a class object …

12.1 Feature engineering

preprocessor <- recipe(heart_disease ~ ., data = train_data) |> 
    step_normalize(all_numeric_predictors()) |> 
    step_dummy(all_nominal_predictors(), one_hot = FALSE)
preprocessor |> 
    prep() |> 
    bake(new_data = NULL) |> 
    head()
# A tibble: 6 × 38
      bmi physical_health mental_health sleep_time heart_disease smoking_Yes
    <dbl>           <dbl>         <dbl>      <dbl> <fct>               <dbl>
1 -1.84           -0.0470         3.28     -1.46   No                      1
2 -1.26           -0.424         -0.490    -0.0681 No                      0
3 -0.275           2.09           3.28      0.626  No                      1
4 -0.648          -0.424         -0.490    -0.762  No                      0
5  0.0846          0.330         -0.490     3.40   Yes                     1
6 -1.05            1.46          -0.490    -2.15   No                      0
# … with 32 more variables: alcohol_drinking_Yes <dbl>, stroke_Yes <dbl>,
#   diff_walking_Yes <dbl>, sex_Male <dbl>, age_category_X25.29 <dbl>,
#   age_category_X30.34 <dbl>, age_category_X35.39 <dbl>,
#   age_category_X40.44 <dbl>, age_category_X45.49 <dbl>,
#   age_category_X50.54 <dbl>, age_category_X55.59 <dbl>,
#   age_category_X60.64 <dbl>, age_category_X65.69 <dbl>,
#   age_category_X70.74 <dbl>, age_category_X75.79 <dbl>, …
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_selector, ColumnTransformer

select_cat_cols = make_column_selector(dtype_include=object)
select_num_cols = make_column_selector(dtype_exclude=object)

preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(drop="first", sparse=False), select_cat_cols(train_data)),
    ('num', StandardScaler(), select_num_cols(train_data))
])
pd.DataFrame(
    preprocessor.fit_transform(train_data),
    columns=preprocessor.get_feature_names_out(preprocessor.feature_names_in_)
).head()
   cat__smoking_Yes  ...  num__sleep_time
0               1.0  ...        -0.065988
1               1.0  ...        -0.763366
2               0.0  ...         0.631390
3               0.0  ...         0.631390
4               1.0  ...        -0.065988

[5 rows x 37 columns]

12.2 Preprocessors

You can find all the available recipes here. For Scikit-Learn, you can see the official API pages.

12.2.1 Normalization

z-score

step_normalize()
step_center()
step_scale()
from sklearn.preprocessing import StandardScaler

StandardScaler()
StandardScaler(with_std=False)
StandardScaler(with_mean=False)

min-max

step_range()
step_range(min = -1, max = 1)
from sklearn.preprocessing import MinMaxScaler

MinMaxScalerScaler()
MinMaxScalerScaler(feature_range=(-1, 1))

Gaussian-like

step_YeoJohnson()
step_BoxCox()
from sklearn.preprocessing import PowerTransformer

PowerTransformer(method="yeo-johnson")
PowerTransformer(method="box-cox")

12.2.2 Binning

step_discretize(num_breaks = 4)
from sklearn.preprocessing import KBinsDiscretizer

KBinsDiscretizer(n_bins=5, encode="ordinal")

12.2.3 Dummy variables

step_dummy(one_hot = TRUE)
step_dummy()
from sklearn.preprocessing import OneHotEncoder

OneHotEncoder()
OneHotEncoder(drop="first")

12.2.4 Imputation

step_impute_mean()
step_impute_median()
step_impute_mode()

step_impute_knn(neighbors = 5)
from sklearn.impute import SimpleImputer, KNNImputer

SimpleImputer(strategy="mean")
SimpleImputer(strategy="median")
SimpleImputer(strategy="most_frequent")

KNNImputer(n_neighbors=5)

12.2.5 Augmentation

step_poly(degree=2)
step_bs(deg_free = 8, degree = 3)
from sklearn.preprocessing import PolynomialFeatures, SplineTransformer

PolynomialFeatures(degree=2)
SplineTransformer(n_knots=5, degree=3, knots="quantile")

12.2.6 Apply functions

step_log()
step_log(base = 10)

step_hyperbolic(func = "sin")
step_hyperbolic(func = "cos")
step_hyperbolic(func = "tan")

step_sqrt()
import numpy as np
from sklearn.preprocessing import FunctionTransformer

FunctionTransformer(func=np.log)
FunctionTransformer(func=np.log10)

FunctionTransformer(func=np.sin)
FunctionTransformer(func=np.cos)
FunctionTransformer(func=np.tan)

FunctionTransformer(func=np.sqrt)

12.2.7 Zero-variance

step_zv()
from sklearn.feature_selection import VarianceThreshold

VarianceThreshold(threshold=0)