🔨 Build models with parsnip
To specify a model with parsnip
- Pick a model
- Set the engine
- Set the mode (if needed)
Provides a unified interface to many models, regardless of the underlying package.
To specify a model with parsnip
logistic_reg() |>
set_engine("glm") |>
set_mode("classification")
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
To specify a model with parsnip
decision_tree() |>
set_engine("C5.0") |>
set_mode("classification")
## Decision Tree Model Specification (classification)
##
## Computational engine: C5.0
To specify a model with parsnip
nearest_neighbor() |>
set_engine("kknn") |>
set_mode("classification")
## K-Nearest Neighbor Model Specification (classification)
##
## Computational engine: kknn
logistic_reg()
Specifies a model that uses logistic regression
logistic_reg(penalty = NULL, mixture = NULL)
logistic_reg()
Specifies a model that uses logistic regression
logistic_reg(
mode = "classification", # "default" mode, if exists
penalty = NULL, # model hyper-parameter
mixture = NULL # model hyper-parameter
)
set_engine()
Adds an engine to power or implement the model.
logistic_reg() |> set_engine(engine = "glm")
Set the engine when you define the model type.
logistic_reg(engine = "glm")
set_mode()
Sets the class of problem the model will solve, which influences which output is collected. Not necessary if mode is set in Step 1.
logistic_reg() |> set_mode(mode = "classification")
⏱️ Your turn 1
Run the chunk in your .qmd and look at the output. Then, copy/paste the code and edit to create:
Save it as tree_mod
and look at the object. What is different about the output?
Hint: you’ll need https://www.tidymodels.org/find/parsnip/
lr_mod
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
tree_mod <- decision_tree() |>
set_engine(engine = "C5.0") |>
set_mode("classification")
tree_mod
## Decision Tree Model Specification (classification)
##
## Computational engine: C5.0
Now we’ve built a model.
But, how do we use a model?
First - what does it mean to use a model?
Statistical models learn from the data.
Many learn model parameters, which can be useful as values for inference and interpretation.
fit()
Train a model by fitting a model. Returns a parsnip model fit.
fit(tree_mod, children ~ average_daily_rate + stays_in_weekend_nights, data = hotels)
fit()
Train a model by fitting a model. Returns a parsnip model fit.
tree_mod |>
fit(
children ~ average_daily_rate + stays_in_weekend_nights,
data = hotels
)
- 1
-
parsnip model
- 2
-
Formula for the model
- 3
-
Data set
A fitted model
lr_mod |>
fit(children ~ average_daily_rate + stays_in_weekend_nights,
data = hotels
) |>
broom::tidy()
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.06 0.0956 21.5 5.31e-103
## 2 average_daily_rate -0.0168 0.000710 -23.7 9.78e-124
## 3 stays_in_weekend_n… -0.0671 0.0352 -1.91 5.63e- 2
“All models are wrong, but some are useful”
## Truth
## Prediction children none
## children 1277 483
## none 723 1517
“All models are wrong, but some are useful”
## Truth
## Prediction children none
## children 1446 646
## none 554 1354
Axiom
The best way to measure a model’s performance at predicting new data is to predict new data.
♻️ Resample models with rsample
The holdout method
initial_split()*
“Splits” data randomly into a single testing and a single training set.
initial_split(data, prop = 3 / 4)
initial_split()
hotels_split <- initial_split(data = hotels, prop = 3 / 4)
hotels_split
## <Training/Testing/Total>
## <3000/1000/4000>
training()
and testing()*
Extract training and testing sets from an rsplit
training(hotels_split)
testing(hotels_split)
training()
hotels_train <- training(hotels_split)
hotels_train
## # A tibble: 3,000 × 22
## hotel lead_time stays_in_weekend_nights
## <fct> <dbl> <dbl>
## 1 City_Hotel 65 1
## 2 City_Hotel 0 2
## 3 Resort_Hotel 119 2
## 4 City_Hotel 302 2
## 5 City_Hotel 157 1
## 6 Resort_Hotel 42 1
## 7 City_Hotel 36 0
## 8 City_Hotel 87 1
## 9 City_Hotel 6 2
## 10 City_Hotel 108 0
## # ℹ 2,990 more rows
## # ℹ 19 more variables: stays_in_week_nights <dbl>,
## # adults <dbl>, children <fct>, meal <fct>,
## # market_segment <fct>, …
⏱️ Your turn 2
Fill in the blanks.
Use initial_split()
, training()
, and testing()
to:
Split hotels into training and testing sets. Save the rsplit!
Extract the training data and testing data.
Check the proportions of the children
variable in each set.
Keep set.seed(100)
at the start of your code.
set.seed(100) # Important!
hotels_split <- initial_split(hotels, prop = 3 / 4)
hotels_train <- training(hotels_split)
hotels_test <- testing(hotels_split)
count(x = hotels_train, children) |>
mutate(prop = n / sum(n))
## # A tibble: 2 × 3
## children n prop
## <fct> <int> <dbl>
## 1 children 1503 0.501
## 2 none 1497 0.499
count(x = hotels_test, children) |>
mutate(prop = n / sum(n))
## # A tibble: 2 × 3
## children n prop
## <fct> <int> <dbl>
## 1 children 497 0.497
## 2 none 503 0.503
Note: Stratified sampling
Use stratified random sampling to ensure identical proportions in training/test sets
set.seed(100) # Important!
hotels_split <- initial_split(hotels, prop = 3 / 4, strata = children)
hotels_train <- training(hotels_split)
hotels_test <- testing(hotels_split)
count(x = hotels_train, children) |>
mutate(prop = n / sum(n))
## # A tibble: 2 × 3
## children n prop
## <fct> <int> <dbl>
## 1 children 1500 0.5
## 2 none 1500 0.5
count(x = hotels_test, children) |>
mutate(prop = n / sum(n))
## # A tibble: 2 × 3
## children n prop
## <fct> <int> <dbl>
## 1 children 500 0.5
## 2 none 500 0.5
The testing set is precious…
we can only use it once!
How can we use the training set to compare, evaluate, and tune models?
V-fold cross-validation
vfold_cv(data, v = 10, ...)
Guess
How many times does an observation/row appear in the assessment set?
Quiz
If we use 10 folds, which percent of our data will end up in the analysis set and which percent in the assessment set for each fold?
90% - analysis
10% - assessment
⏱️ Your Turn 3
Run the code below. What does it return?
set.seed(100)
hotels_folds <- vfold_cv(data = hotels_train, v = 10)
hotels_folds
set.seed(100)
hotels_folds <- vfold_cv(data = hotels_train, v = 10)
hotels_folds
## # 10-fold cross-validation
## # A tibble: 10 × 2
## splits id
## <list> <chr>
## 1 <split [2700/300]> Fold01
## 2 <split [2700/300]> Fold02
## 3 <split [2700/300]> Fold03
## 4 <split [2700/300]> Fold04
## 5 <split [2700/300]> Fold05
## 6 <split [2700/300]> Fold06
## 7 <split [2700/300]> Fold07
## 8 <split [2700/300]> Fold08
## 9 <split [2700/300]> Fold09
## 10 <split [2700/300]> Fold10
fit_resamples()
Trains and tests a resampled model.
tree_mod |>
fit_resamples(
children ~ average_daily_rate + stays_in_weekend_nights,
resamples = hotels_folds
)
tree_mod |>
fit_resamples(
children ~ average_daily_rate + stays_in_weekend_nights,
resamples = hotels_folds
)
## # Resampling results
## # 10-fold cross-validation
## # A tibble: 10 × 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [2700/300]> Fold01 <tibble [2 × 4]> <tibble>
## 2 <split [2700/300]> Fold02 <tibble [2 × 4]> <tibble>
## 3 <split [2700/300]> Fold03 <tibble [2 × 4]> <tibble>
## 4 <split [2700/300]> Fold04 <tibble [2 × 4]> <tibble>
## 5 <split [2700/300]> Fold05 <tibble [2 × 4]> <tibble>
## 6 <split [2700/300]> Fold06 <tibble [2 × 4]> <tibble>
## 7 <split [2700/300]> Fold07 <tibble [2 × 4]> <tibble>
## 8 <split [2700/300]> Fold08 <tibble [2 × 4]> <tibble>
## 9 <split [2700/300]> Fold09 <tibble [2 × 4]> <tibble>
## 10 <split [2700/300]> Fold10 <tibble [2 × 4]> <tibble>
collect_metrics()
Unnest the metrics column from a tidymodels fit_resamples()
_results |> collect_metrics(summarize = TRUE)
tree_fit <- tree_mod |>
fit_resamples(
children ~ average_daily_rate + stays_in_weekend_nights,
resamples = hotels_folds
)
collect_metrics(tree_fit)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.696 10 0.0105 Preprocessor1_Mod…
## 2 roc_auc binary 0.697 10 0.0104 Preprocessor1_Mod…
collect_metrics(tree_fit, summarize = FALSE)
## # A tibble: 20 × 5
## id .metric .estimator .estimate .config
## <chr> <chr> <chr> <dbl> <chr>
## 1 Fold01 accuracy binary 0.677 Preprocessor1_Model1
## 2 Fold01 roc_auc binary 0.670 Preprocessor1_Model1
## 3 Fold02 accuracy binary 0.73 Preprocessor1_Model1
## 4 Fold02 roc_auc binary 0.731 Preprocessor1_Model1
## 5 Fold03 accuracy binary 0.683 Preprocessor1_Model1
## 6 Fold03 roc_auc binary 0.698 Preprocessor1_Model1
## 7 Fold04 accuracy binary 0.71 Preprocessor1_Model1
## 8 Fold04 roc_auc binary 0.718 Preprocessor1_Model1
## 9 Fold05 accuracy binary 0.737 Preprocessor1_Model1
## 10 Fold05 roc_auc binary 0.735 Preprocessor1_Model1
## # ℹ 10 more rows
10-fold CV
10 different analysis/assessment sets
10 different models (trained on analysis sets)
10 different sets of performance statistics (on assessment sets)