library(tidyverse)
library(tidymodels)
library(palmerpenguins)
library(skimr)
Palmer Penguins and regression with multiple predictors
Suggested answers
In this application exercise we will continue to study penguins. The data can be found in the palmerpenguins package and we will use tidyverse and tidymodels for data exploration and modeling, respectively.
Please read the following context and take a skim
of the data set before we get started.
This data set comprising various measurements of three different penguin species, namely Adelie, Gentoo, and Chinstrap. The rigorous study was conducted in the islands of the Palmer Archipelago, Antarctica. These data were collected from 2007 to 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data set is called
penguins
.
skim(penguins)
Name | penguins |
Number of rows | 344 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
factor | 3 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
Our goal is to understand better how various body measurements and attributes of penguins relate to their body mass.
Body mass vs. flipper length
The regression model for body mass vs. flipper length is as follows.
<- linear_reg() |>
bm_fl_fit fit(body_mass_g ~ flipper_length_mm, data = penguins)
tidy(bm_fl_fit)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -5781. 306. -18.9 5.59e- 55
2 flipper_length_mm 49.7 1.52 32.7 4.37e-107
And this is the regression model for body mass vs. island.
<- linear_reg() |>
bm_island_fit fit(body_mass_g ~ island, data = penguins)
tidy(bm_island_fit)
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4716. 48.5 97.3 8.93e-250
2 islandDream -1003. 74.2 -13.5 1.42e- 33
3 islandTorgersen -1010. 100. -10.1 4.66e- 21
Body mass vs. flipper length (transformed)
- Your turn: Run the code chunk below and create two separate plots. How are the two plots different than each other? Which plot’s model appears to fit the data better?
# Plot A
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
labs(title = "Plot A - Linear model")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).
# Plot B
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", formula = y ~ poly(x, degree = 2, raw = TRUE)) +
labs(title = "Plot B - Polynomial model")
Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Removed 2 rows containing missing values (`geom_point()`).
Plot B appears to better fit the contours of the scatterplot.
Fitting a model with transformed variables
Demo: Fit the polynomial model from above.
# manually create polynomials
<- linear_reg() |>
bm_fl_poly_fit fit(body_mass_g ~ flipper_length_mm + I(flipper_length_mm^2), data = penguins)
tidy(bm_fl_poly_fit)
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 16771. 4560. 3.68 0.000273
2 flipper_length_mm -173. 45.0 -3.85 0.000141
3 I(flipper_length_mm^2) 0.548 0.111 4.96 0.00000114
# use poly()
<- linear_reg() |>
bm_fl_poly_fit fit(body_mass_g ~ poly(flipper_length_mm, degree = 2, raw = TRUE), data = drop_na(penguins, flipper_length_mm))
tidy(bm_fl_poly_fit)
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1.68e+4 4560. 3.68 2.73e-4
2 poly(flipper_length_mm, degree = 2, raw … -1.73e+2 45.0 -3.85 1.41e-4
3 poly(flipper_length_mm, degree = 2, raw … 5.48e-1 0.111 4.96 1.14e-6
\[ \widehat{body~mass} = 16770 - 173 \times flipper~length + 0.55 \times flipper~length^2 \]
Your turn: Interpret the slope coefficients for flipper length in the context of the data and the research question.
For every 1 millimeter the flipper is longer, we expect body mass to be different, on average, by \(-173 + 1.10 \times flipper~length\) grams. The expected difference in flipper length varies based on the initial value of flipper length.
Demo: Predict the body mass of penguins with flipper lengths of 180, 190, 200, and 210 mm.
<- tibble(
penguin_flippers flipper_length_mm = c(180, 190, 200, 210)
)
predict(bm_fl_poly_fit, new_data = penguin_flippers) |>
# difference in predicted value compared to previous row
mutate(diff = .pred - lag(.pred))
# A tibble: 4 × 2
.pred diff
<dbl> <dbl>
1 3350. NA
2 3646. 296.
3 4051. 406.
4 4566. 515.
Body mass vs. flipper length and island
Next, we will expand our understanding of models by continuing to learn about penguins. So far, we modeled body mass by flipper length, and in a separate model, modeled body mass by island. Could it be possible that the estimated body mass of a penguin changes by both their flipper length AND by the island they are on?
- Demo: Fit a model to predict body mass from flipper length and island. Display the summary output and write out the estimate regression equation below.
<- linear_reg() |>
bm_fl_island_fit fit(body_mass_g ~ flipper_length_mm + island, data = penguins)
tidy(bm_fl_island_fit)
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -4625. 392. -11.8 4.29e-27
2 flipper_length_mm 44.5 1.87 23.9 1.65e-74
3 islandDream -262. 55.0 -4.77 2.75e- 6
4 islandTorgersen -185. 70.3 -2.63 8.84e- 3
\[ \widehat{body~mass} = -4625 + 44.5 \times flipper~length - 262 \times Dream - 185 \times Torgersen \]
Your turn: Interpret the slope coefficient for flipper length in the context of the data and the research question.
For every 1 millimeter the flipper is longer, we expect body mass to be higher, on average, by 44.5 grams, holding all else (the island) constant. In other words, this is true for penguins in a given island, regardless of the island.
Demo: Predict the body mass of a Dream island penguin with a flipper length of 200 mm.
<- tibble(
penguin_200_Dream flipper_length_mm = 200,
island = "Dream"
)
predict(bm_fl_island_fit, new_data = penguin_200_Dream)
# A tibble: 1 × 1
.pred
<dbl>
1 4021.
Additive vs. interaction models
Review: What assumption does the additive model make about the slopes between flipper length and body mass for each of the three islands?
The additive model assumes the same slope between body mass and flipper length for all three islands.
Demo: Now fit an interaction model of body mass vs. flipper length and island.
<- linear_reg() |>
bm_fl_island_int_fit fit(body_mass_g ~ flipper_length_mm * island, data = penguins)
tidy(bm_fl_island_int_fit)
# A tibble: 6 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -5464. 431. -12.7 2.51e-30
2 flipper_length_mm 48.5 2.05 23.7 1.66e-73
3 islandDream 3551. 969. 3.66 2.89e- 4
4 islandTorgersen 3218. 1680. 1.92 5.62e- 2
5 flipper_length_mm:islandDream -19.4 4.94 -3.93 1.03e- 4
6 flipper_length_mm:islandTorgersen -17.4 8.73 -1.99 4.69e- 2
\[ \begin{aligned} \widehat{body~mass} &= -5464 + 48.5 \times flipper~length \\ &\quad+ 3551 \times Dream + 3218 \times Torgersen \\ &\quad- 19.4 \times flipper~length \times Dream \\ &\quad- 17.4 \times flipper~length \times Torgersen \end{aligned} \]
Review: What does modeling body mass with an interaction effect get us that without doing so does not?
The interaction effect allows us to model the rate of change in estimated body mass as flipper length increases as different in the three islands.
Your turn: Predict the body mass of a Dream island penguin with a flipper length of 200 mm.
predict(bm_fl_island_int_fit, new_data = penguin_200_Dream)
# A tibble: 1 × 1
.pred
<dbl>
1 3915.
Choosing a model
Rule of thumb: Occam’s Razor - Don’t overcomplicate the situation! We prefer the simplest best model.
glance(bm_fl_fit)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.759 0.758 394. 1071. 4.37e-107 1 -2528. 5063. 5074.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(bm_fl_poly_fit)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.775 0.774 381. 585. 1.27e-110 2 -2516. 5041. 5056.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(bm_fl_island_fit)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.774 0.772 383. 386. 7.60e-109 3 -2517. 5045. 5064.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(bm_fl_island_int_fit)
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.786 0.783 374. 246. 4.55e-110 5 -2508. 5031. 5057.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Your turn: Which model do you believe is most appropriate? Why?
Could go with multiple choices. Personally I would choose the flipper length + island model, as it incorporates two variables that seem correlated with body mass, has a sufficiently high adjusted-\(R^2\), and is not overly complex (e.g. easier to interpret).
::session_info() sessioninfo
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.2 (2023-10-31)
os macOS Ventura 13.5.2
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2024-03-23
pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
backports 1.4.1 2021-12-13 [1] CRAN (R 4.3.0)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.3.0)
broom * 1.0.5 2023-06-09 [1] CRAN (R 4.3.0)
class 7.3-22 2023-05-03 [1] CRAN (R 4.3.2)
cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.1)
codetools 0.2-19 2023-02-01 [1] CRAN (R 4.3.2)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)
data.table 1.14.10 2023-12-08 [1] CRAN (R 4.3.1)
dials * 1.2.0 2023-04-03 [1] CRAN (R 4.3.0)
DiceDesign 1.10 2023-12-07 [1] CRAN (R 4.3.1)
digest 0.6.34 2024-01-11 [1] CRAN (R 4.3.1)
dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.1)
evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.1)
fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.1)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)
fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)
foreach 1.5.2 2022-02-02 [1] CRAN (R 4.3.0)
furrr 0.3.1 2022-08-15 [1] CRAN (R 4.3.0)
future 1.33.1 2023-12-22 [1] CRAN (R 4.3.1)
future.apply 1.11.1 2023-12-21 [1] CRAN (R 4.3.1)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
ggplot2 * 3.4.4 2023-10-12 [1] CRAN (R 4.3.1)
globals 0.16.2 2022-11-21 [1] CRAN (R 4.3.0)
glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.1)
gower 1.0.1 2022-12-22 [1] CRAN (R 4.3.0)
GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.3.0)
gtable 0.3.4 2023-08-21 [1] CRAN (R 4.3.0)
hardhat 1.3.0 2023-03-30 [1] CRAN (R 4.3.0)
here 1.0.1 2020-12-13 [1] CRAN (R 4.3.0)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)
htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.1)
htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.3.1)
infer * 1.0.5 2023-09-06 [1] CRAN (R 4.3.0)
ipred 0.9-14 2023-03-09 [1] CRAN (R 4.3.0)
iterators 1.0.14 2022-02-05 [1] CRAN (R 4.3.0)
jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.1)
knitr 1.45 2023-10-30 [1] CRAN (R 4.3.1)
labeling 0.4.3 2023-08-29 [1] CRAN (R 4.3.0)
lattice 0.21-9 2023-10-01 [1] CRAN (R 4.3.2)
lava 1.7.3 2023-11-04 [1] CRAN (R 4.3.1)
lhs 1.1.6 2022-12-17 [1] CRAN (R 4.3.0)
lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1)
listenv 0.9.0 2022-12-16 [1] CRAN (R 4.3.0)
lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.3.1)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
MASS 7.3-60 2023-05-04 [1] CRAN (R 4.3.2)
Matrix 1.6-1.1 2023-09-18 [1] CRAN (R 4.3.2)
mgcv 1.9-0 2023-07-11 [1] CRAN (R 4.3.2)
modeldata * 1.2.0 2023-08-09 [1] CRAN (R 4.3.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)
nlme 3.1-163 2023-08-09 [1] CRAN (R 4.3.2)
nnet 7.3-19 2023-05-03 [1] CRAN (R 4.3.2)
palmerpenguins * 0.1.1 2022-08-15 [1] CRAN (R 4.3.0)
parallelly 1.36.0 2023-05-26 [1] CRAN (R 4.3.0)
parsnip * 1.2.0 2024-02-16 [1] CRAN (R 4.3.1)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
prodlim 2023.08.28 2023-08-28 [1] CRAN (R 4.3.0)
purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
Rcpp 1.0.12 2024-01-09 [1] CRAN (R 4.3.1)
readr * 2.1.5 2024-01-10 [1] CRAN (R 4.3.1)
recipes * 1.0.9 2023-12-13 [1] CRAN (R 4.3.1)
repr 1.1.6 2023-01-26 [1] CRAN (R 4.3.0)
rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.1)
rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.1)
rpart 4.1.21 2023-10-09 [1] CRAN (R 4.3.2)
rprojroot 2.0.4 2023-11-05 [1] CRAN (R 4.3.1)
rsample * 1.2.0 2023-08-23 [1] CRAN (R 4.3.0)
rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)
scales * 1.2.1 2024-01-18 [1] Github (r-lib/scales@c8eb772)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
skimr * 2.1.5 2022-12-23 [1] CRAN (R 4.3.0)
stringi 1.8.3 2023-12-11 [1] CRAN (R 4.3.1)
stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.3.1)
survival 3.5-7 2023-08-14 [1] CRAN (R 4.3.2)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
tidymodels * 1.1.1 2023-08-24 [1] CRAN (R 4.3.0)
tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)
timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)
timeDate 4032.109 2023-12-14 [1] CRAN (R 4.3.1)
tune * 1.1.2 2023-08-23 [1] CRAN (R 4.3.0)
tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)
utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.1)
vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.1)
withr 2.5.2 2023-10-30 [1] CRAN (R 4.3.1)
workflows * 1.1.4 2024-02-19 [1] CRAN (R 4.3.1)
workflowsets * 1.0.1 2023-04-06 [1] CRAN (R 4.3.0)
xfun 0.41 2023-11-01 [1] CRAN (R 4.3.1)
yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.1)
yardstick * 1.3.0 2024-01-19 [1] CRAN (R 4.3.1)
[1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
──────────────────────────────────────────────────────────────────────────────