Palmer Penguins and regression with multiple predictors

Suggested answers

Application exercise
Answers
Modified

March 20, 2024

In this application exercise we will continue to study penguins. The data can be found in the palmerpenguins package and we will use tidyverse and tidymodels for data exploration and modeling, respectively.

library(tidyverse)
library(tidymodels)
library(palmerpenguins)
library(skimr)

Please read the following context and take a skim of the data set before we get started.

This data set comprising various measurements of three different penguin species, namely Adelie, Gentoo, and Chinstrap. The rigorous study was conducted in the islands of the Palmer Archipelago, Antarctica. These data were collected from 2007 to 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data set is called penguins.

skim(penguins)
Data summary
Name penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇

Our goal is to understand better how various body measurements and attributes of penguins relate to their body mass.

Body mass vs. flipper length

The regression model for body mass vs. flipper length is as follows.

bm_fl_fit <- linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm, data = penguins)

tidy(bm_fl_fit)
# A tibble: 2 × 5
  term              estimate std.error statistic   p.value
  <chr>                <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)        -5781.     306.       -18.9 5.59e- 55
2 flipper_length_mm     49.7      1.52      32.7 4.37e-107

And this is the regression model for body mass vs. island.

bm_island_fit <- linear_reg() |>
  fit(body_mass_g ~ island, data = penguins)

tidy(bm_island_fit)
# A tibble: 3 × 5
  term            estimate std.error statistic   p.value
  <chr>              <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)        4716.      48.5      97.3 8.93e-250
2 islandDream       -1003.      74.2     -13.5 1.42e- 33
3 islandTorgersen   -1010.     100.      -10.1 4.66e- 21

Body mass vs. flipper length (transformed)

  • Your turn: Run the code chunk below and create two separate plots. How are the two plots different than each other? Which plot’s model appears to fit the data better?
# Plot A
ggplot(data = penguins, 
       mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  labs(title = "Plot A - Linear model")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
# Plot B
ggplot(data = penguins, 
       mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", formula = y ~ poly(x, degree = 2, raw = TRUE)) +
  labs(title = "Plot B - Polynomial model")
Warning: Removed 2 rows containing non-finite outside the scale range (`stat_smooth()`).
Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Plot B appears to better fit the contours of the scatterplot.

Fitting a model with transformed variables

Demo: Fit the polynomial model from above.

# manually create polynomials
bm_fl_poly_fit <- linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm + I(flipper_length_mm^2), data = penguins)
tidy(bm_fl_poly_fit)
# A tibble: 3 × 5
  term                    estimate std.error statistic    p.value
  <chr>                      <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)            16771.     4560.         3.68 0.000273  
2 flipper_length_mm       -173.       45.0       -3.85 0.000141  
3 I(flipper_length_mm^2)     0.548     0.111      4.96 0.00000114
# use poly()
bm_fl_poly_fit <- linear_reg() |>
  fit(body_mass_g ~ poly(flipper_length_mm, degree = 2, raw = TRUE), data = drop_na(penguins, flipper_length_mm))

tidy(bm_fl_poly_fit)
# A tibble: 3 × 5
  term                                      estimate std.error statistic p.value
  <chr>                                        <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)                                1.68e+4  4560.         3.68 2.73e-4
2 poly(flipper_length_mm, degree = 2, raw … -1.73e+2    45.0       -3.85 1.41e-4
3 poly(flipper_length_mm, degree = 2, raw …  5.48e-1     0.111      4.96 1.14e-6

\[ \widehat{body~mass} = 16770 - 173 \times flipper~length + 0.55 \times flipper~length^2 \]

  • Your turn: Interpret the slope coefficients for flipper length in the context of the data and the research question.

    For every 1 millimeter the flipper is longer, we expect body mass to be different, on average, by \(-173 + 1.10 \times flipper~length\) grams. The expected difference in flipper length varies based on the initial value of flipper length.

  • Demo: Predict the body mass of penguins with flipper lengths of 180, 190, 200, and 210 mm.

penguin_flippers <- tibble(
  flipper_length_mm = c(180, 190, 200, 210)
)

predict(bm_fl_poly_fit, new_data = penguin_flippers) |>
  # difference in predicted value compared to previous row
  mutate(diff = .pred - lag(.pred))
# A tibble: 4 × 2
  .pred  diff
  <dbl> <dbl>
1 3350.   NA 
2 3646.  296.
3 4051.  406.
4 4566.  515.

Body mass vs. flipper length and island

Next, we will expand our understanding of models by continuing to learn about penguins. So far, we modeled body mass by flipper length, and in a separate model, modeled body mass by island. Could it be possible that the estimated body mass of a penguin changes by both their flipper length AND by the island they are on?

  • Demo: Fit a model to predict body mass from flipper length and island. Display the summary output and write out the estimate regression equation below.
bm_fl_island_fit <- linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm + island, data = penguins)

tidy(bm_fl_island_fit)
# A tibble: 4 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        -4625.     392.      -11.8  4.29e-27
2 flipper_length_mm     44.5      1.87     23.9  1.65e-74
3 islandDream         -262.      55.0      -4.77 2.75e- 6
4 islandTorgersen     -185.      70.3      -2.63 8.84e- 3

\[ \widehat{body~mass} = -4625 + 44.5 \times flipper~length - 262 \times Dream - 185 \times Torgersen \]

  • Your turn: Interpret the slope coefficient for flipper length in the context of the data and the research question.

    For every 1 millimeter the flipper is longer, we expect body mass to be higher, on average, by 44.5 grams, holding all else (the island) constant. In other words, this is true for penguins in a given island, regardless of the island.

  • Demo: Predict the body mass of a Dream island penguin with a flipper length of 200 mm.

penguin_200_Dream <- tibble(
  flipper_length_mm = 200,
  island = "Dream"
)

predict(bm_fl_island_fit, new_data = penguin_200_Dream)
# A tibble: 1 × 1
  .pred
  <dbl>
1 4021.

Additive vs. interaction models

  • Review: What assumption does the additive model make about the slopes between flipper length and body mass for each of the three islands?

    The additive model assumes the same slope between body mass and flipper length for all three islands.

  • Demo: Now fit an interaction model of body mass vs. flipper length and island.

bm_fl_island_int_fit <- linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm * island, data = penguins)

tidy(bm_fl_island_int_fit)
# A tibble: 6 × 5
  term                              estimate std.error statistic  p.value
  <chr>                                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                        -5464.     431.      -12.7  2.51e-30
2 flipper_length_mm                     48.5      2.05     23.7  1.66e-73
3 islandDream                         3551.     969.        3.66 2.89e- 4
4 islandTorgersen                     3218.    1680.        1.92 5.62e- 2
5 flipper_length_mm:islandDream        -19.4      4.94     -3.93 1.03e- 4
6 flipper_length_mm:islandTorgersen    -17.4      8.73     -1.99 4.69e- 2

\[ \begin{aligned} \widehat{body~mass} &= -5464 + 48.5 \times flipper~length \\ &\quad+ 3551 \times Dream + 3218 \times Torgersen \\ &\quad- 19.4 \times flipper~length \times Dream \\ &\quad- 17.4 \times flipper~length \times Torgersen \end{aligned} \]

  • Review: What does modeling body mass with an interaction effect get us that without doing so does not?

    The interaction effect allows us to model the rate of change in estimated body mass as flipper length increases as different in the three islands.

  • Your turn: Predict the body mass of a Dream island penguin with a flipper length of 200 mm.

predict(bm_fl_island_int_fit, new_data = penguin_200_Dream)
# A tibble: 1 × 1
  .pred
  <dbl>
1 3915.

Choosing a model

Rule of thumb: Occam’s Razor - Don’t overcomplicate the situation! We prefer the simplest best model.

glance(bm_fl_fit)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.759         0.758  394.     1071. 4.37e-107     1 -2528. 5063. 5074.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(bm_fl_poly_fit)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.775         0.774  381.      585. 1.27e-110     2 -2516. 5041. 5056.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(bm_fl_island_fit)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.774         0.772  383.      386. 7.60e-109     3 -2517. 5045. 5064.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(bm_fl_island_int_fit)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.786         0.783  374.      246. 4.55e-110     5 -2508. 5031. 5057.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Your turn: Which model do you believe is most appropriate? Why?

Could go with multiple choices. Personally I would choose the flipper length + island model, as it incorporates two variables that seem correlated with body mass, has a sufficiently high adjusted-\(R^2\), and is not overly complex (e.g. easier to interpret).

sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31)
 os       macOS Ventura 13.6.6
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2024-06-13
 pandoc   3.1.11 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package        * version    date (UTC) lib source
 backports        1.4.1      2021-12-13 [1] CRAN (R 4.3.0)
 base64enc        0.1-3      2015-07-28 [1] CRAN (R 4.3.0)
 broom          * 1.0.5      2023-06-09 [1] CRAN (R 4.3.0)
 class            7.3-22     2023-05-03 [1] CRAN (R 4.3.2)
 cli              3.6.2      2023-12-11 [1] CRAN (R 4.3.1)
 codetools        0.2-19     2023-02-01 [1] CRAN (R 4.3.2)
 colorspace       2.1-0      2023-01-23 [1] CRAN (R 4.3.0)
 data.table       1.15.4     2024-03-30 [1] CRAN (R 4.3.1)
 dials          * 1.2.1      2024-02-22 [1] CRAN (R 4.3.1)
 DiceDesign       1.10       2023-12-07 [1] CRAN (R 4.3.1)
 digest           0.6.35     2024-03-11 [1] CRAN (R 4.3.1)
 dplyr          * 1.1.4      2023-11-17 [1] CRAN (R 4.3.1)
 evaluate         0.23       2023-11-01 [1] CRAN (R 4.3.1)
 fansi            1.0.6      2023-12-08 [1] CRAN (R 4.3.1)
 farver           2.1.2      2024-05-13 [1] CRAN (R 4.3.3)
 fastmap          1.1.1      2023-02-24 [1] CRAN (R 4.3.0)
 forcats        * 1.0.0      2023-01-29 [1] CRAN (R 4.3.0)
 foreach          1.5.2      2022-02-02 [1] CRAN (R 4.3.0)
 furrr            0.3.1      2022-08-15 [1] CRAN (R 4.3.0)
 future           1.33.2     2024-03-26 [1] CRAN (R 4.3.1)
 future.apply     1.11.2     2024-03-28 [1] CRAN (R 4.3.1)
 generics         0.1.3      2022-07-05 [1] CRAN (R 4.3.0)
 ggplot2        * 3.5.1      2024-04-23 [1] CRAN (R 4.3.1)
 globals          0.16.3     2024-03-08 [1] CRAN (R 4.3.1)
 glue             1.7.0      2024-01-09 [1] CRAN (R 4.3.1)
 gower            1.0.1      2022-12-22 [1] CRAN (R 4.3.0)
 GPfit            1.0-8      2019-02-08 [1] CRAN (R 4.3.0)
 gtable           0.3.5      2024-04-22 [1] CRAN (R 4.3.1)
 hardhat          1.3.1      2024-02-02 [1] CRAN (R 4.3.1)
 here             1.0.1      2020-12-13 [1] CRAN (R 4.3.0)
 hms              1.1.3      2023-03-21 [1] CRAN (R 4.3.0)
 htmltools        0.5.8.1    2024-04-04 [1] CRAN (R 4.3.1)
 htmlwidgets      1.6.4      2023-12-06 [1] CRAN (R 4.3.1)
 infer          * 1.0.7      2024-03-25 [1] CRAN (R 4.3.1)
 ipred            0.9-14     2023-03-09 [1] CRAN (R 4.3.0)
 iterators        1.0.14     2022-02-05 [1] CRAN (R 4.3.0)
 jsonlite         1.8.8      2023-12-04 [1] CRAN (R 4.3.1)
 knitr            1.45       2023-10-30 [1] CRAN (R 4.3.1)
 labeling         0.4.3      2023-08-29 [1] CRAN (R 4.3.0)
 lattice          0.21-9     2023-10-01 [1] CRAN (R 4.3.2)
 lava             1.8.0      2024-03-05 [1] CRAN (R 4.3.1)
 lhs              1.1.6      2022-12-17 [1] CRAN (R 4.3.0)
 lifecycle        1.0.4      2023-11-07 [1] CRAN (R 4.3.1)
 listenv          0.9.1      2024-01-29 [1] CRAN (R 4.3.1)
 lubridate      * 1.9.3      2023-09-27 [1] CRAN (R 4.3.1)
 magrittr         2.0.3      2022-03-30 [1] CRAN (R 4.3.0)
 MASS             7.3-60     2023-05-04 [1] CRAN (R 4.3.2)
 Matrix           1.6-1.1    2023-09-18 [1] CRAN (R 4.3.2)
 mgcv             1.9-0      2023-07-11 [1] CRAN (R 4.3.2)
 modeldata      * 1.3.0      2024-01-21 [1] CRAN (R 4.3.1)
 munsell          0.5.1      2024-04-01 [1] CRAN (R 4.3.1)
 nlme             3.1-163    2023-08-09 [1] CRAN (R 4.3.2)
 nnet             7.3-19     2023-05-03 [1] CRAN (R 4.3.2)
 palmerpenguins * 0.1.1      2022-08-15 [1] CRAN (R 4.3.0)
 parallelly       1.37.1     2024-02-29 [1] CRAN (R 4.3.1)
 parsnip        * 1.2.1      2024-03-22 [1] CRAN (R 4.3.1)
 pillar           1.9.0      2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig        2.0.3      2019-09-22 [1] CRAN (R 4.3.0)
 prodlim          2023.08.28 2023-08-28 [1] CRAN (R 4.3.0)
 purrr          * 1.0.2      2023-08-10 [1] CRAN (R 4.3.0)
 R6               2.5.1      2021-08-19 [1] CRAN (R 4.3.0)
 Rcpp             1.0.12     2024-01-09 [1] CRAN (R 4.3.1)
 readr          * 2.1.5      2024-01-10 [1] CRAN (R 4.3.1)
 recipes        * 1.0.10     2024-02-18 [1] CRAN (R 4.3.1)
 repr             1.1.6      2023-01-26 [1] CRAN (R 4.3.0)
 rlang            1.1.3      2024-01-10 [1] CRAN (R 4.3.1)
 rmarkdown        2.26       2024-03-05 [1] CRAN (R 4.3.1)
 rpart            4.1.21     2023-10-09 [1] CRAN (R 4.3.2)
 rprojroot        2.0.4      2023-11-05 [1] CRAN (R 4.3.1)
 rsample        * 1.2.1      2024-03-25 [1] CRAN (R 4.3.1)
 rstudioapi       0.16.0     2024-03-24 [1] CRAN (R 4.3.1)
 scales         * 1.3.0.9000 2024-05-07 [1] Github (r-lib/scales@c0f79d3)
 sessioninfo      1.2.2      2021-12-06 [1] CRAN (R 4.3.0)
 skimr          * 2.1.5      2022-12-23 [1] CRAN (R 4.3.0)
 stringi          1.8.3      2023-12-11 [1] CRAN (R 4.3.1)
 stringr        * 1.5.1      2023-11-14 [1] CRAN (R 4.3.1)
 survival         3.5-7      2023-08-14 [1] CRAN (R 4.3.2)
 tibble         * 3.2.1      2023-03-20 [1] CRAN (R 4.3.0)
 tidymodels     * 1.2.0      2024-03-25 [1] CRAN (R 4.3.1)
 tidyr          * 1.3.1      2024-01-24 [1] CRAN (R 4.3.1)
 tidyselect       1.2.1      2024-03-11 [1] CRAN (R 4.3.1)
 tidyverse      * 2.0.0      2023-02-22 [1] CRAN (R 4.3.0)
 timechange       0.3.0      2024-01-18 [1] CRAN (R 4.3.1)
 timeDate         4032.109   2023-12-14 [1] CRAN (R 4.3.1)
 tune           * 1.2.1      2024-04-18 [1] CRAN (R 4.3.1)
 tzdb             0.4.0      2023-05-12 [1] CRAN (R 4.3.0)
 utf8             1.2.4      2023-10-22 [1] CRAN (R 4.3.1)
 vctrs            0.6.5      2023-12-01 [1] CRAN (R 4.3.1)
 withr            3.0.0      2024-01-16 [1] CRAN (R 4.3.1)
 workflows      * 1.1.4      2024-02-19 [1] CRAN (R 4.3.1)
 workflowsets   * 1.1.0      2024-03-21 [1] CRAN (R 4.3.1)
 xfun             0.43       2024-03-25 [1] CRAN (R 4.3.1)
 yaml             2.3.8      2023-12-11 [1] CRAN (R 4.3.1)
 yardstick      * 1.3.1      2024-03-21 [1] CRAN (R 4.3.1)

 [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────