Build better training data

Lecture 23

Dr. Benjamin Soltoff

Cornell University
INFO 2950 - Spring 2024

April 18, 2024

Announcements

Project drafts

Application exercise

`ae-21`

Go to the course GitHub org and find your ae-21 (repo name will be suffixed with your GitHub name).
Clone the repo in RStudio Workbench, open the Quarto document in the repo, and follow along and complete the exercises.
Render, commit, and push your edits by the AE deadline – end of tomorrow

Import data

hotels <- read_csv("data/hotels.csv") |>
  mutate(across(where(is.character), as.factor))

count(hotels, children)

# A tibble: 2 × 2
  children     n
  <fct>    <int>
1 children  4039
2 none     45961

👩🏼‍🍳 Build a better training set with `recipes`

`recipes`

Preprocessing (feature engineering) options

Encode categorical predictors
Center and scale variables
Handle class imbalance
Impute missing data
Perform dimensionality reduction
A lot more!

To build a recipe

Start the recipe()
Define the variables involved
Describe preprocessing step-by-step

`recipe()`

Creates a recipe for a set of variables

recipe(children ~ ., data = hotels)

rec <- recipe(children ~ ., data = hotels)

`step_*()`

Adds a single transformation to a recipe. Transformations are replayed in order when the recipe is run on data.

rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date)

Before recipe

# A tibble: 45,000 × 1
   arrival_date
   <date>      
 1 2016-04-28  
 2 2016-12-29  
 3 2016-10-17  
 4 2016-05-22  
 5 2016-03-02  
 6 2016-06-16  
 7 2017-02-13  
 8 2017-08-20  
 9 2017-08-22  
10 2017-05-18  
# ℹ 44,990 more rows

After recipe

# A tibble: 45,000 × 4
   arrival_date arrival_date_dow arrival_date_month arrival_date_year
   <date>       <fct>            <fct>                          <int>
 1 2016-04-28   Thu              Apr                             2016
 2 2016-12-29   Thu              Dec                             2016
 3 2016-10-17   Mon              Oct                             2016
 4 2016-05-22   Sun              May                             2016
 5 2016-03-02   Wed              Mar                             2016
 6 2016-06-16   Thu              Jun                             2016
 7 2017-02-13   Mon              Feb                             2017
 8 2017-08-20   Sun              Aug                             2017
 9 2017-08-22   Tue              Aug                             2017
10 2017-05-18   Thu              May                             2017
# ℹ 44,990 more rows

`step_*()`

Complete list at: https://recipes.tidymodels.org/reference/index.html

`step_holiday()` + `step_rm()`

Generate a set of indicator variables for specific holidays.

holidays <- c(
  "AllSouls", "AshWednesday", "ChristmasEve", "Easter",
  "ChristmasDay", "GoodFriday", "NewYearsDay", "PalmSunday"
)

rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date)

`step_holiday()` + `step_rm()`

Rows: 45,000
Columns: 11
$ arrival_date_dow          <fct> Thu, Thu, Mon, Sun, Wed, Thu, Mon, Sun, Tue,…
$ arrival_date_month        <fct> Apr, Dec, Oct, May, Mar, Jun, Feb, Aug, Aug,…
$ arrival_date_year         <int> 2016, 2016, 2016, 2016, 2016, 2016, 2017, 20…
$ arrival_date_AllSouls     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ arrival_date_AshWednesday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ arrival_date_ChristmasEve <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ arrival_date_Easter       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ arrival_date_ChristmasDay <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ arrival_date_GoodFriday   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ arrival_date_NewYearsDay  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ arrival_date_PalmSunday   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

K Nearest Neighbors (KNN)

To predict the outcome of a new data point:

Find the K most similar training observations
Take the average/mode/etc. outcome

To specify a model with `parsnip`

Pick a model
Set the engine
Set the mode (if needed)

To specify a KNN model with `parsnip`

knn_mod <- nearest_neighbor() |>
  set_engine("kknn") |>
  set_mode("classification")

Fact

KNN requires all numeric predictors, and all need to be centered and scaled.

What does that mean?

Quiz

Why do you need to “train” a recipe?

Imagine “scaling” a new data point. What do you subtract from it? What do you divide it by?

Guess

# A tibble: 5 × 1
  meal     
  <fct>    
1 SC       
2 BB       
3 HB       
4 Undefined
5 FB

# A tibble: 50,000 × 5
      SC    BB    HB Undefined    FB
   <dbl> <dbl> <dbl>     <dbl> <dbl>
 1     1     0     0         0     0
 2     0     1     0         0     0
 3     0     1     0         0     0
 4     0     1     0         0     0
 5     0     1     0         0     0
 6     0     1     0         0     0
 7     0     0     1         0     0
 8     0     1     0         0     0
 9     0     0     1         0     0
10     1     0     0         0     0
# ℹ 49,990 more rows

Dummy Variables

logistic_reg() |>
  fit(children ~ meal, data = hotels) |>
  broom::tidy()

# A tibble: 5 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      2.38     0.0183    130.   0       
2 mealFB          -1.15     0.165      -6.98 2.88e-12
3 mealHB          -0.118    0.0465     -2.54 1.12e- 2
4 mealSC           1.43     0.104      13.7  1.37e-42
5 mealUndefined    0.570    0.188       3.03 2.47e- 3

`step_dummy()`

Converts nominal data into numeric dummy variables, needed as predictors for models like KNN.

rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |>
  step_dummy(all_nominal_predictors())

Quiz

How does recipes know which variables are numeric and which are nominal?

rec <- recipe(
  children ~ .,
  data = hotels
)

Quiz

How does recipes know what is a predictor and what is an outcome?

rec <- recipe(
  children ~ .,
  data = hotels
)

The formula → indicates outcomes vs predictors

The data → is only used to catalog the names and types of each variable

Selectors

Helper functions for selecting sets of variables

rec |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

Some common selector functions

selector	description
`all_predictors()`	Each x variable (right side of ~)
`all_outcomes()`	Each y variable (left side of ~)
`all_numeric()`	Each numeric variable
`all_nominal()`	Each categorical variable (e.g. factor, string)
`all_nominal_predictors()`	Each categorical variable (e.g. factor, string) that is defined as a predictor
`all_numeric_predictors()`	Each numeric variable that is defined as a predictor
`dplyr::select()` helpers	`starts_with('NY_')`, etc.

Guess

What would happen if you try to normalize a variable that doesn’t vary?

Error! You’d be dividing by zero!

`step_zv()`

Intelligently handles zero variance variables (variables that contain only a single value)

rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

`step_normalize()`

Centers then scales numeric variable (mean = 0, sd = 1)

rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric())

Imbalanced outcome

`step_downsample()`

library(themis)

rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric()) |>
  step_downsample(children)

After downsampling

⏱️ Your Turn 1

Unscramble! You have all the steps from our knn_rec- your challenge is to unscramble them into the right order!

Save the result as knn_rec

03:00

knn_rec <- recipe(children ~ ., data = hotels) |>
  step_date(arrival_date) |>
  step_holiday(arrival_date, holidays = holidays) |>
  step_rm(arrival_date) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_numeric()) |>
  step_downsample(children)
knn_rec

📜 Create boilerplate code using `usemodels`

`usemodels`

https://tidymodels.github.io/usemodels/

library(usemodels)
use_kknn(children ~ ., data = hotels, verbose = TRUE, tune = FALSE)

kknn_recipe <- 
  recipe(formula = children ~ ., data = hotels) %>% 
  ## Since distance calculations are used, the predictor variables should 
  ## be on the same scale. Before centering and scaling the numeric 
  ## predictors, any predictors with a single unique value are filtered 
  ## out. 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) 

kknn_spec <- 
  nearest_neighbor() %>% 
  set_mode("classification") %>% 
  set_engine("kknn") 

kknn_workflow <- 
  workflow() %>% 
  add_recipe(kknn_recipe) %>% 
  add_model(kknn_spec)

use_glmnet(children ~ ., data = hotels, verbose = TRUE, tune = FALSE)

glmnet_recipe <- 
  recipe(formula = children ~ ., data = hotels) %>% 
  ## Regularization methods sum up functions of the model slope 
  ## coefficients. Because of this, the predictor variables should be on 
  ## the same scale. Before centering and scaling the numeric predictors, 
  ## any predictors with a single unique value are filtered out. 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) 

glmnet_spec <- 
  logistic_reg() %>% 
  set_mode("classification") %>% 
  set_engine("glmnet") 

glmnet_workflow <- 
  workflow() %>% 
  add_recipe(glmnet_recipe) %>% 
  add_model(glmnet_spec)

Note

usemodels creates boilerplate code using the older pipe operator %>%

Now we’ve built a recipe.

But, how do we use a recipe?

Axiom

Feature engineering and modeling are two halves of a single predictive workflow.

🪢🪵 Bundling machine learning workflows with `workflow()`

`workflow()`

Creates a workflow to which you can add a model (and more)

workflow()

`add_formula()`

Adds a formula to a workflow *

workflow() |> add_formula(children ~ average_daily_rate)

`add_model()`

Adds a parsnip model spec to a workflow

workflow() |> add_model(knn_mod)

Guess

If we use add_model() to add a model to a workflow, what would we use to add a recipe?

Let’s see!

⏱️ Your Turn 2

Fill in the blanks to make a workflow that combines knn_rec and with knn_mod.

01:00

knn_wf <- workflow() |>
  add_recipe(knn_rec) |>
  add_model(knn_mod)
knn_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()

── Preprocessor ────────────────────────────────────────────────────────────────
7 Recipe Steps

• step_date()
• step_holiday()
• step_rm()
• step_dummy()
• step_zv()
• step_normalize()
• step_downsample()

── Model ───────────────────────────────────────────────────────────────────────
K-Nearest Neighbor Model Specification (classification)

Computational engine: kknn

`add_recipe()`

Adds a recipe to a workflow.

knn_wf <- workflow() |>
  add_recipe(knn_rec) |>
  add_model(knn_mod)

Guess

Do you need to add a formula if you have a recipe?

Nope!

rec <- recipe(
  children ~ .,
  data = hotels
)

`fit()`

Fit a workflow that bundles a recipe* and a model.

knn_wf |>
  fit(data = hotels_train) |>
  predict(hotels_test)

Preprocess k-fold resamples?

set.seed(100)
hotels_folds <- vfold_cv(hotels_train,
  v = 10,
  strata = children
)

`fit_resamples()`

Fit a workflow that bundles a recipe* and a model with resampling.

knn_wf |>
  fit_resamples(resamples = hotels_folds)

⏱️ Your Turn 3

Run the first chunk. Then try our KNN workflow on hotels_folds. What is the ROC AUC?

03:00

set.seed(100)
hotels_folds <- vfold_cv(hotels_train, v = 10, strata = children)

knn_wf |>
  fit_resamples(resamples = hotels_folds) |>
  collect_metrics()

# A tibble: 2 × 6
  .metric  .estimator  mean     n std_err .config           
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>             
1 accuracy binary     0.741    10 0.00209 Preprocessor1_Mod…
2 roc_auc  binary     0.830    10 0.00231 Preprocessor1_Mod…

`update_recipe()`

Replace the recipe in a workflow.

knn_wf |>
  update_recipe(glmnet_rec)

`update_model()`

Replace the model in a workflow.

knn_wf |>
  update_model(tree_mod)

⏱️ Your Turn 4

Turns out, the same knn_rec recipe can also be used to fit a penalized logistic regression model. Let’s try it out!

plr_mod <- logistic_reg(penalty = .01, mixture = 1) |>
  set_engine("glmnet") |>
  set_mode("classification")

plr_mod |>
  translate()

Logistic Regression Model Specification (classification)

Main Arguments:
  penalty = 0.01
  mixture = 1

Computational engine: glmnet 

Model fit template:
glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
    alpha = 1, family = "binomial")

03:00

glmnet_wf <- knn_wf |>
  update_model(plr_mod)

glmnet_wf |>
  fit_resamples(resamples = hotels_folds) |>
  collect_metrics()

# A tibble: 2 × 6
  .metric  .estimator  mean     n std_err .config           
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>             
1 accuracy binary     0.828    10 0.00218 Preprocessor1_Mod…
2 roc_auc  binary     0.873    10 0.00210 Preprocessor1_Mod…

Recap

Feature engineering defines a series of pre-processing steps to prepare for modeling the outcome of interest
- Some feature engineering steps are required for specific types of models
- Others are dependent on specific types of variables/data structures
Feature engineering and modeling are two halves of a single predictive workflow
Feature engineering requires training, just like the model
Implement feature engineering using recipes
Leverage workflow() to create explicitly, logical pipelines for training a machine learning model

Acknowledgments

Materials derived from Tidymodels, Virtually by Allison Hill and licensed under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA) License.
Dataset and some modeling steps derived from A predictive modeling case study and licensed under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA) License.

Favorite reaction GIF

Build better training data

Announcements

Announcements

Application exercise

ae-21

Import data

👩🏼‍🍳 Build a better training set with recipes

recipes

Preprocessing (feature engineering) options

To build a recipe

recipe()

step_*()

Before recipe

After recipe

step_*()

step_holiday() + step_rm()

step_holiday() + step_rm()

K Nearest Neighbors (KNN)

K Nearest Neighbors (KNN)

To specify a model with parsnip

To specify a KNN model with parsnip

Fact

Quiz

Guess

Dummy Variables

step_dummy()

Quiz

Quiz

Selectors

Some common selector functions

Guess

step_zv()

step_normalize()

Imbalanced outcome

step_downsample()

After downsampling

⏱️ Your Turn 1

📜 Create boilerplate code using usemodels

usemodels

Axiom

🪢🪵 Bundling machine learning workflows with workflow()

workflow()

add_formula()

add_model()

Guess

⏱️ Your Turn 2

add_recipe()

Guess

fit()

Preprocess k-fold resamples?

fit_resamples()

⏱️ Your Turn 3

update_recipe()

update_model()

⏱️ Your Turn 4

Recap

Acknowledgments

Favorite reaction GIF

`ae-21`

👩🏼‍🍳 Build a better training set with `recipes`

`recipes`

`recipe()`

`step_*()`

`step_*()`

`step_holiday()` + `step_rm()`

`step_holiday()` + `step_rm()`

To specify a model with `parsnip`

To specify a KNN model with `parsnip`

`step_dummy()`

`step_zv()`

`step_normalize()`

`step_downsample()`

📜 Create boilerplate code using `usemodels`

`usemodels`

🪢🪵 Bundling machine learning workflows with `workflow()`

`workflow()`

`add_formula()`

`add_model()`

`add_recipe()`

`fit()`

`fit_resamples()`

`update_recipe()`

`update_model()`