Palmer Penguins and regression with multiple predictors

Data summary
Name	penguins
Number of rows	344
Number of columns	8
_______________________
Column type frequency:
factor	3
numeric	5
________________________
Group variables	None

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
species	0	1.00	FALSE	3	Ade: 152, Gen: 124, Chi: 68
island	0	1.00	FALSE	3	Bis: 168, Dre: 124, Tor: 52
sex	11	0.97	FALSE	2	mal: 168, fem: 165

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bill_length_mm	2	0.99	43.92	5.46	32.1	39.23	44.45	48.5	59.6	▃▇▇▆▁
bill_depth_mm	2	0.99	17.15	1.97	13.1	15.60	17.30	18.7	21.5	▅▅▇▇▂
flipper_length_mm	2	0.99	200.92	14.06	172.0	190.00	197.00	213.0	231.0	▂▇▃▅▂
body_mass_g	2	0.99	4201.75	801.95	2700.0	3550.00	4050.00	4750.0	6300.0	▃▇▆▃▂
year	0	1.00	2008.03	0.82	2007.0	2007.00	2008.00	2009.0	2009.0	▇▁▇▁▇

Body mass vs. flipper length

The regression model for body mass vs. flipper length is as follows.

bm_fl_fit <- linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm, data = penguins)

tidy(bm_fl_fit)

# A tibble: 2 × 5
  term              estimate std.error statistic   p.value
  <chr>                <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)        -5781.     306.       -18.9 5.59e- 55
2 flipper_length_mm     49.7      1.52      32.7 4.37e-107

And this is the regression model for body mass vs. island.

bm_island_fit <- linear_reg() |>
  fit(body_mass_g ~ island, data = penguins)

tidy(bm_island_fit)

# A tibble: 3 × 5
  term            estimate std.error statistic   p.value
  <chr>              <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)        4716.      48.5      97.3 8.93e-250
2 islandDream       -1003.      74.2     -13.5 1.42e- 33
3 islandTorgersen   -1010.     100.      -10.1 4.66e- 21

Body mass vs. flipper length (transformed)

Your turn: Run the code chunk below and create two separate plots. How are the two plots different than each other? Which plot’s model appears to fit the data better?

# Plot A
ggplot(data = penguins, 
       mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  labs(title = "Plot A - Linear model")

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).

Warning: Removed 2 rows containing missing values (`geom_point()`).

# Plot B
ggplot(data = penguins, 
       mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", formula = y ~ poly(x, degree = 2, raw = TRUE)) +
  labs(title = "Plot B - Polynomial model")

Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
Removed 2 rows containing missing values (`geom_point()`).

Plot B appears to better fit the contours of the scatterplot.

Fitting a model with transformed variables

Demo: Fit the polynomial model from above.

# manually create polynomials
bm_fl_poly_fit <- linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm + I(flipper_length_mm^2), data = penguins)
tidy(bm_fl_poly_fit)

# A tibble: 3 × 5
  term                    estimate std.error statistic    p.value
  <chr>                      <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)            16771.     4560.         3.68 0.000273  
2 flipper_length_mm       -173.       45.0       -3.85 0.000141  
3 I(flipper_length_mm^2)     0.548     0.111      4.96 0.00000114

# use poly()
bm_fl_poly_fit <- linear_reg() |>
  fit(body_mass_g ~ poly(flipper_length_mm, degree = 2, raw = TRUE), data = drop_na(penguins, flipper_length_mm))

tidy(bm_fl_poly_fit)

# A tibble: 3 × 5
  term                                      estimate std.error statistic p.value
  <chr>                                        <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)                                1.68e+4  4560.         3.68 2.73e-4
2 poly(flipper_length_mm, degree = 2, raw … -1.73e+2    45.0       -3.85 1.41e-4
3 poly(flipper_length_mm, degree = 2, raw …  5.48e-1     0.111      4.96 1.14e-6

\[ \widehat{body~mass} = 16770 - 173 \times flipper~length + 0.55 \times flipper~length^2 \]

Your turn: Interpret the slope coefficients for flipper length in the context of the data and the research question.

For every 1 millimeter the flipper is longer, we expect body mass to be different, on average, by \(-173 + 1.10 \times flipper~length\) grams. The expected difference in flipper length varies based on the initial value of flipper length.
Demo: Predict the body mass of penguins with flipper lengths of 180, 190, 200, and 210 mm.

penguin_flippers <- tibble(
  flipper_length_mm = c(180, 190, 200, 210)
)

predict(bm_fl_poly_fit, new_data = penguin_flippers) |>
  # difference in predicted value compared to previous row
  mutate(diff = .pred - lag(.pred))

# A tibble: 4 × 2
  .pred  diff
  <dbl> <dbl>
1 3350.   NA 
2 3646.  296.
3 4051.  406.
4 4566.  515.

Body mass vs. flipper length and island

Next, we will expand our understanding of models by continuing to learn about penguins. So far, we modeled body mass by flipper length, and in a separate model, modeled body mass by island. Could it be possible that the estimated body mass of a penguin changes by both their flipper length AND by the island they are on?

Demo: Fit a model to predict body mass from flipper length and island. Display the summary output and write out the estimate regression equation below.

bm_fl_island_fit <- linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm + island, data = penguins)

tidy(bm_fl_island_fit)

# A tibble: 4 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        -4625.     392.      -11.8  4.29e-27
2 flipper_length_mm     44.5      1.87     23.9  1.65e-74
3 islandDream         -262.      55.0      -4.77 2.75e- 6
4 islandTorgersen     -185.      70.3      -2.63 8.84e- 3

\[ \widehat{body~mass} = -4625 + 44.5 \times flipper~length - 262 \times Dream - 185 \times Torgersen \]

Your turn: Interpret the slope coefficient for flipper length in the context of the data and the research question.

For every 1 millimeter the flipper is longer, we expect body mass to be higher, on average, by 44.5 grams, holding all else (the island) constant. In other words, this is true for penguins in a given island, regardless of the island.
Demo: Predict the body mass of a Dream island penguin with a flipper length of 200 mm.

penguin_200_Dream <- tibble(
  flipper_length_mm = 200,
  island = "Dream"
)

predict(bm_fl_island_fit, new_data = penguin_200_Dream)

# A tibble: 1 × 1
  .pred
  <dbl>
1 4021.

Additive vs. interaction models

Review: What assumption does the additive model make about the slopes between flipper length and body mass for each of the three islands?

The additive model assumes the same slope between body mass and flipper length for all three islands.
Demo: Now fit an interaction model of body mass vs. flipper length and island.

bm_fl_island_int_fit <- linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm * island, data = penguins)

tidy(bm_fl_island_int_fit)

# A tibble: 6 × 5
  term                              estimate std.error statistic  p.value
  <chr>                                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                        -5464.     431.      -12.7  2.51e-30
2 flipper_length_mm                     48.5      2.05     23.7  1.66e-73
3 islandDream                         3551.     969.        3.66 2.89e- 4
4 islandTorgersen                     3218.    1680.        1.92 5.62e- 2
5 flipper_length_mm:islandDream        -19.4      4.94     -3.93 1.03e- 4
6 flipper_length_mm:islandTorgersen    -17.4      8.73     -1.99 4.69e- 2

\[ \begin{aligned} \widehat{body~mass} &= -5464 + 48.5 \times flipper~length \\ &\quad+ 3551 \times Dream + 3218 \times Torgersen \\ &\quad- 19.4 \times flipper~length \times Dream \\ &\quad- 17.4 \times flipper~length \times Torgersen \end{aligned} \]

Review: What does modeling body mass with an interaction effect get us that without doing so does not?

The interaction effect allows us to model the rate of change in estimated body mass as flipper length increases as different in the three islands.
Your turn: Predict the body mass of a Dream island penguin with a flipper length of 200 mm.

predict(bm_fl_island_int_fit, new_data = penguin_200_Dream)

# A tibble: 1 × 1
  .pred
  <dbl>
1 3915.

Choosing a model

Rule of thumb: Occam’s Razor - Don’t overcomplicate the situation! We prefer the simplest best model.

glance(bm_fl_fit)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.759         0.758  394.     1071. 4.37e-107     1 -2528. 5063. 5074.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

glance(bm_fl_poly_fit)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.775         0.774  381.      585. 1.27e-110     2 -2516. 5041. 5056.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

glance(bm_fl_island_fit)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.774         0.772  383.      386. 7.60e-109     3 -2517. 5045. 5064.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

glance(bm_fl_island_int_fit)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.786         0.783  374.      246. 4.55e-110     5 -2508. 5031. 5057.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Your turn: Which model do you believe is most appropriate? Why?

Could go with multiple choices. Personally I would choose the flipper length + island model, as it incorporates two variables that seem correlated with body mass, has a sufficiently high adjusted-\(R^2\), and is not overly complex (e.g. easier to interpret).

Session information

sessioninfo::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31)
 os       macOS Ventura 13.5.2
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2024-03-23
 pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package        * version    date (UTC) lib source
 backports        1.4.1      2021-12-13 [1] CRAN (R 4.3.0)
 base64enc        0.1-3      2015-07-28 [1] CRAN (R 4.3.0)
 broom          * 1.0.5      2023-06-09 [1] CRAN (R 4.3.0)
 class            7.3-22     2023-05-03 [1] CRAN (R 4.3.2)
 cli              3.6.2      2023-12-11 [1] CRAN (R 4.3.1)
 codetools        0.2-19     2023-02-01 [1] CRAN (R 4.3.2)
 colorspace       2.1-0      2023-01-23 [1] CRAN (R 4.3.0)
 data.table       1.14.10    2023-12-08 [1] CRAN (R 4.3.1)
 dials          * 1.2.0      2023-04-03 [1] CRAN (R 4.3.0)
 DiceDesign       1.10       2023-12-07 [1] CRAN (R 4.3.1)
 digest           0.6.34     2024-01-11 [1] CRAN (R 4.3.1)
 dplyr          * 1.1.4      2023-11-17 [1] CRAN (R 4.3.1)
 evaluate         0.23       2023-11-01 [1] CRAN (R 4.3.1)
 fansi            1.0.6      2023-12-08 [1] CRAN (R 4.3.1)
 farver           2.1.1      2022-07-06 [1] CRAN (R 4.3.0)
 fastmap          1.1.1      2023-02-24 [1] CRAN (R 4.3.0)
 forcats        * 1.0.0      2023-01-29 [1] CRAN (R 4.3.0)
 foreach          1.5.2      2022-02-02 [1] CRAN (R 4.3.0)
 furrr            0.3.1      2022-08-15 [1] CRAN (R 4.3.0)
 future           1.33.1     2023-12-22 [1] CRAN (R 4.3.1)
 future.apply     1.11.1     2023-12-21 [1] CRAN (R 4.3.1)
 generics         0.1.3      2022-07-05 [1] CRAN (R 4.3.0)
 ggplot2        * 3.4.4      2023-10-12 [1] CRAN (R 4.3.1)
 globals          0.16.2     2022-11-21 [1] CRAN (R 4.3.0)
 glue             1.7.0      2024-01-09 [1] CRAN (R 4.3.1)
 gower            1.0.1      2022-12-22 [1] CRAN (R 4.3.0)
 GPfit            1.0-8      2019-02-08 [1] CRAN (R 4.3.0)
 gtable           0.3.4      2023-08-21 [1] CRAN (R 4.3.0)
 hardhat          1.3.0      2023-03-30 [1] CRAN (R 4.3.0)
 here             1.0.1      2020-12-13 [1] CRAN (R 4.3.0)
 hms              1.1.3      2023-03-21 [1] CRAN (R 4.3.0)
 htmltools        0.5.7      2023-11-03 [1] CRAN (R 4.3.1)
 htmlwidgets      1.6.4      2023-12-06 [1] CRAN (R 4.3.1)
 infer          * 1.0.5      2023-09-06 [1] CRAN (R 4.3.0)
 ipred            0.9-14     2023-03-09 [1] CRAN (R 4.3.0)
 iterators        1.0.14     2022-02-05 [1] CRAN (R 4.3.0)
 jsonlite         1.8.8      2023-12-04 [1] CRAN (R 4.3.1)
 knitr            1.45       2023-10-30 [1] CRAN (R 4.3.1)
 labeling         0.4.3      2023-08-29 [1] CRAN (R 4.3.0)
 lattice          0.21-9     2023-10-01 [1] CRAN (R 4.3.2)
 lava             1.7.3      2023-11-04 [1] CRAN (R 4.3.1)
 lhs              1.1.6      2022-12-17 [1] CRAN (R 4.3.0)
 lifecycle        1.0.4      2023-11-07 [1] CRAN (R 4.3.1)
 listenv          0.9.0      2022-12-16 [1] CRAN (R 4.3.0)
 lubridate      * 1.9.3      2023-09-27 [1] CRAN (R 4.3.1)
 magrittr         2.0.3      2022-03-30 [1] CRAN (R 4.3.0)
 MASS             7.3-60     2023-05-04 [1] CRAN (R 4.3.2)
 Matrix           1.6-1.1    2023-09-18 [1] CRAN (R 4.3.2)
 mgcv             1.9-0      2023-07-11 [1] CRAN (R 4.3.2)
 modeldata      * 1.2.0      2023-08-09 [1] CRAN (R 4.3.0)
 munsell          0.5.0      2018-06-12 [1] CRAN (R 4.3.0)
 nlme             3.1-163    2023-08-09 [1] CRAN (R 4.3.2)
 nnet             7.3-19     2023-05-03 [1] CRAN (R 4.3.2)
 palmerpenguins * 0.1.1      2022-08-15 [1] CRAN (R 4.3.0)
 parallelly       1.36.0     2023-05-26 [1] CRAN (R 4.3.0)
 parsnip        * 1.2.0      2024-02-16 [1] CRAN (R 4.3.1)
 pillar           1.9.0      2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig        2.0.3      2019-09-22 [1] CRAN (R 4.3.0)
 prodlim          2023.08.28 2023-08-28 [1] CRAN (R 4.3.0)
 purrr          * 1.0.2      2023-08-10 [1] CRAN (R 4.3.0)
 R6               2.5.1      2021-08-19 [1] CRAN (R 4.3.0)
 Rcpp             1.0.12     2024-01-09 [1] CRAN (R 4.3.1)
 readr          * 2.1.5      2024-01-10 [1] CRAN (R 4.3.1)
 recipes        * 1.0.9      2023-12-13 [1] CRAN (R 4.3.1)
 repr             1.1.6      2023-01-26 [1] CRAN (R 4.3.0)
 rlang            1.1.3      2024-01-10 [1] CRAN (R 4.3.1)
 rmarkdown        2.25       2023-09-18 [1] CRAN (R 4.3.1)
 rpart            4.1.21     2023-10-09 [1] CRAN (R 4.3.2)
 rprojroot        2.0.4      2023-11-05 [1] CRAN (R 4.3.1)
 rsample        * 1.2.0      2023-08-23 [1] CRAN (R 4.3.0)
 rstudioapi       0.15.0     2023-07-07 [1] CRAN (R 4.3.0)
 scales         * 1.2.1      2024-01-18 [1] Github (r-lib/scales@c8eb772)
 sessioninfo      1.2.2      2021-12-06 [1] CRAN (R 4.3.0)
 skimr          * 2.1.5      2022-12-23 [1] CRAN (R 4.3.0)
 stringi          1.8.3      2023-12-11 [1] CRAN (R 4.3.1)
 stringr        * 1.5.1      2023-11-14 [1] CRAN (R 4.3.1)
 survival         3.5-7      2023-08-14 [1] CRAN (R 4.3.2)
 tibble         * 3.2.1      2023-03-20 [1] CRAN (R 4.3.0)
 tidymodels     * 1.1.1      2023-08-24 [1] CRAN (R 4.3.0)
 tidyr          * 1.3.0      2023-01-24 [1] CRAN (R 4.3.0)
 tidyselect       1.2.0      2022-10-10 [1] CRAN (R 4.3.0)
 tidyverse      * 2.0.0      2023-02-22 [1] CRAN (R 4.3.0)
 timechange       0.2.0      2023-01-11 [1] CRAN (R 4.3.0)
 timeDate         4032.109   2023-12-14 [1] CRAN (R 4.3.1)
 tune           * 1.1.2      2023-08-23 [1] CRAN (R 4.3.0)
 tzdb             0.4.0      2023-05-12 [1] CRAN (R 4.3.0)
 utf8             1.2.4      2023-10-22 [1] CRAN (R 4.3.1)
 vctrs            0.6.5      2023-12-01 [1] CRAN (R 4.3.1)
 withr            2.5.2      2023-10-30 [1] CRAN (R 4.3.1)
 workflows      * 1.1.4      2024-02-19 [1] CRAN (R 4.3.1)
 workflowsets   * 1.0.1      2023-04-06 [1] CRAN (R 4.3.0)
 xfun             0.41       2023-11-01 [1] CRAN (R 4.3.1)
 yaml             2.3.8      2023-12-11 [1] CRAN (R 4.3.1)
 yardstick      * 1.3.0      2024-01-19 [1] CRAN (R 4.3.1)

 [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────