12 Preprocessing
Pre-processing or feature engineering is a crucial step to take in every machine learning project. Some models require dummy variables instead of categorical ones, or normalized numerical variables, or also NA values to be imputed.
In tidymodels
, every data pre-processing operation can be specified trough a step_
function of the recipes
package. You start by defining a recipe
object in which you specify the predictors and response variable as well as the data. Then, you add incrementally the step_
functions, which transform accordingly your data.
In Scikit-Learn
, every pre-processor is a class object …
12.1 Feature engineering
<- recipe(heart_disease ~ ., data = train_data) |>
preprocessor step_normalize(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors(), one_hot = FALSE)
|>
preprocessor prep() |>
bake(new_data = NULL) |>
head()
# A tibble: 6 × 38
bmi physical_health mental_health sleep_time heart_disease smoking_Yes
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 -1.84 -0.0470 3.28 -1.46 No 1
2 -1.26 -0.424 -0.490 -0.0681 No 0
3 -0.275 2.09 3.28 0.626 No 1
4 -0.648 -0.424 -0.490 -0.762 No 0
5 0.0846 0.330 -0.490 3.40 Yes 1
6 -1.05 1.46 -0.490 -2.15 No 0
# … with 32 more variables: alcohol_drinking_Yes <dbl>, stroke_Yes <dbl>,
# diff_walking_Yes <dbl>, sex_Male <dbl>, age_category_X25.29 <dbl>,
# age_category_X30.34 <dbl>, age_category_X35.39 <dbl>,
# age_category_X40.44 <dbl>, age_category_X45.49 <dbl>,
# age_category_X50.54 <dbl>, age_category_X55.59 <dbl>,
# age_category_X60.64 <dbl>, age_category_X65.69 <dbl>,
# age_category_X70.74 <dbl>, age_category_X75.79 <dbl>, …
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_selector, ColumnTransformer
= make_column_selector(dtype_include=object)
select_cat_cols = make_column_selector(dtype_exclude=object)
select_num_cols
= ColumnTransformer([
preprocessor 'cat', OneHotEncoder(drop="first", sparse=False), select_cat_cols(train_data)),
('num', StandardScaler(), select_num_cols(train_data))
( ])
pd.DataFrame(
preprocessor.fit_transform(train_data),=preprocessor.get_feature_names_out(preprocessor.feature_names_in_)
columns ).head()
cat__smoking_Yes ... num__sleep_time
0 1.0 ... -0.065988
1 1.0 ... -0.763366
2 0.0 ... 0.631390
3 0.0 ... 0.631390
4 1.0 ... -0.065988
[5 rows x 37 columns]
12.2 Preprocessors
You can find all the available recipes here. For Scikit-Learn, you can see the official API pages.
12.2.1 Normalization
z-score
step_normalize()
step_center()
step_scale()
from sklearn.preprocessing import StandardScaler
StandardScaler()=False)
StandardScaler(with_std=False) StandardScaler(with_mean
min-max
step_range()
step_range(min = -1, max = 1)
from sklearn.preprocessing import MinMaxScaler
MinMaxScalerScaler()=(-1, 1)) MinMaxScalerScaler(feature_range
Gaussian-like
step_YeoJohnson()
step_BoxCox()
from sklearn.preprocessing import PowerTransformer
="yeo-johnson")
PowerTransformer(method="box-cox") PowerTransformer(method
12.2.2 Binning
step_discretize(num_breaks = 4)
from sklearn.preprocessing import KBinsDiscretizer
=5, encode="ordinal") KBinsDiscretizer(n_bins
12.2.3 Dummy variables
step_dummy(one_hot = TRUE)
step_dummy()
from sklearn.preprocessing import OneHotEncoder
OneHotEncoder()="first") OneHotEncoder(drop
12.2.4 Imputation
step_impute_mean()
step_impute_median()
step_impute_mode()
step_impute_knn(neighbors = 5)
from sklearn.impute import SimpleImputer, KNNImputer
="mean")
SimpleImputer(strategy="median")
SimpleImputer(strategy="most_frequent")
SimpleImputer(strategy
=5) KNNImputer(n_neighbors
12.2.5 Augmentation
step_poly(degree=2)
step_bs(deg_free = 8, degree = 3)
from sklearn.preprocessing import PolynomialFeatures, SplineTransformer
=2)
PolynomialFeatures(degree=5, degree=3, knots="quantile") SplineTransformer(n_knots
12.2.6 Apply functions
step_log()
step_log(base = 10)
step_hyperbolic(func = "sin")
step_hyperbolic(func = "cos")
step_hyperbolic(func = "tan")
step_sqrt()
import numpy as np
from sklearn.preprocessing import FunctionTransformer
=np.log)
FunctionTransformer(func=np.log10)
FunctionTransformer(func
=np.sin)
FunctionTransformer(func=np.cos)
FunctionTransformer(func=np.tan)
FunctionTransformer(func
=np.sqrt) FunctionTransformer(func
12.2.7 Zero-variance
step_zv()
from sklearn.feature_selection import VarianceThreshold
=0) VarianceThreshold(threshold