12 Preprocessing
Pre-processing or feature engineering is a crucial step to take in every machine learning project. Some models require dummy variables instead of categorical ones, or normalized numerical variables, or also NA values to be imputed.
In tidymodels, every data pre-processing operation can be specified trough a step_ function of the recipes package. You start by defining a recipe object in which you specify the predictors and response variable as well as the data. Then, you add incrementally the step_ functions, which transform accordingly your data.
In Scikit-Learn, every pre-processor is a class object …
12.1 Feature engineering
preprocessor <- recipe(heart_disease ~ ., data = train_data) |>
step_normalize(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors(), one_hot = FALSE)preprocessor |>
prep() |>
bake(new_data = NULL) |>
head()# A tibble: 6 × 38
bmi physical_health mental_health sleep_time heart_disease smoking_Yes
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 -1.84 -0.0470 3.28 -1.46 No 1
2 -1.26 -0.424 -0.490 -0.0681 No 0
3 -0.275 2.09 3.28 0.626 No 1
4 -0.648 -0.424 -0.490 -0.762 No 0
5 0.0846 0.330 -0.490 3.40 Yes 1
6 -1.05 1.46 -0.490 -2.15 No 0
# … with 32 more variables: alcohol_drinking_Yes <dbl>, stroke_Yes <dbl>,
# diff_walking_Yes <dbl>, sex_Male <dbl>, age_category_X25.29 <dbl>,
# age_category_X30.34 <dbl>, age_category_X35.39 <dbl>,
# age_category_X40.44 <dbl>, age_category_X45.49 <dbl>,
# age_category_X50.54 <dbl>, age_category_X55.59 <dbl>,
# age_category_X60.64 <dbl>, age_category_X65.69 <dbl>,
# age_category_X70.74 <dbl>, age_category_X75.79 <dbl>, …
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_selector, ColumnTransformer
select_cat_cols = make_column_selector(dtype_include=object)
select_num_cols = make_column_selector(dtype_exclude=object)
preprocessor = ColumnTransformer([
('cat', OneHotEncoder(drop="first", sparse=False), select_cat_cols(train_data)),
('num', StandardScaler(), select_num_cols(train_data))
])pd.DataFrame(
preprocessor.fit_transform(train_data),
columns=preprocessor.get_feature_names_out(preprocessor.feature_names_in_)
).head() cat__smoking_Yes ... num__sleep_time
0 1.0 ... -0.065988
1 1.0 ... -0.763366
2 0.0 ... 0.631390
3 0.0 ... 0.631390
4 1.0 ... -0.065988
[5 rows x 37 columns]
12.2 Preprocessors
You can find all the available recipes here. For Scikit-Learn, you can see the official API pages.
12.2.1 Normalization
z-score
step_normalize()
step_center()
step_scale()from sklearn.preprocessing import StandardScaler
StandardScaler()
StandardScaler(with_std=False)
StandardScaler(with_mean=False)min-max
step_range()
step_range(min = -1, max = 1)from sklearn.preprocessing import MinMaxScaler
MinMaxScalerScaler()
MinMaxScalerScaler(feature_range=(-1, 1))Gaussian-like
step_YeoJohnson()
step_BoxCox()from sklearn.preprocessing import PowerTransformer
PowerTransformer(method="yeo-johnson")
PowerTransformer(method="box-cox")12.2.2 Binning
step_discretize(num_breaks = 4)from sklearn.preprocessing import KBinsDiscretizer
KBinsDiscretizer(n_bins=5, encode="ordinal")12.2.3 Dummy variables
step_dummy(one_hot = TRUE)
step_dummy()from sklearn.preprocessing import OneHotEncoder
OneHotEncoder()
OneHotEncoder(drop="first")12.2.4 Imputation
step_impute_mean()
step_impute_median()
step_impute_mode()
step_impute_knn(neighbors = 5)from sklearn.impute import SimpleImputer, KNNImputer
SimpleImputer(strategy="mean")
SimpleImputer(strategy="median")
SimpleImputer(strategy="most_frequent")
KNNImputer(n_neighbors=5)12.2.5 Augmentation
step_poly(degree=2)
step_bs(deg_free = 8, degree = 3)from sklearn.preprocessing import PolynomialFeatures, SplineTransformer
PolynomialFeatures(degree=2)
SplineTransformer(n_knots=5, degree=3, knots="quantile")12.2.6 Apply functions
step_log()
step_log(base = 10)
step_hyperbolic(func = "sin")
step_hyperbolic(func = "cos")
step_hyperbolic(func = "tan")
step_sqrt()import numpy as np
from sklearn.preprocessing import FunctionTransformer
FunctionTransformer(func=np.log)
FunctionTransformer(func=np.log10)
FunctionTransformer(func=np.sin)
FunctionTransformer(func=np.cos)
FunctionTransformer(func=np.tan)
FunctionTransformer(func=np.sqrt)12.2.7 Zero-variance
step_zv()from sklearn.feature_selection import VarianceThreshold
VarianceThreshold(threshold=0)