AE 04: Joining prognosticators

Prognosticator success

We previously examined the accuracy rate of Groundhog Day prognosticators.¹ Today we want to work with the original dataset to understand how those accuracy metrics were generated and answer the question: How does prognosticator accuracy vary by climatic region?

Let’s start by looking at the seers data frame.

glimpse(seers)

Rows: 1,573
Columns: 7
$ name            <chr> "Punxsutawney Phil", "Punxsutawney Phil", "Punxsutawne…
$ forecaster_type <chr> "Groundhog", "Groundhog", "Groundhog", "Groundhog", "G…
$ alive           <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ town            <chr> "Punxsutawney", "Punxsutawney", "Punxsutawney", "Punxs…
$ state           <chr> "PA", "PA", "PA", "PA", "PA", "PA", "PA", "PA", "PA", …
$ year            <dbl> 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, …
$ prediction      <chr> "Late Winter", "Late Winter", "Late Winter", "Early Sp…

We have the predictions, but our goal is to make a visualization by climate region.²

The nine regions as defined by the National Climatic Data Center and regularly used in climate summaries.

Join the data frames

Let’s take a look at the weather data frame.

glimpse(weather)

Rows: 5,520
Columns: 13
$ region         <chr> "Northeast", "Northeast", "Northeast", "Northeast", "No…
$ state_abb      <chr> "CT", "CT", "CT", "CT", "CT", "CT", "CT", "CT", "CT", "…
$ id             <dbl> 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, …
$ year           <dbl> 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1…
$ avg_temp       <dbl> 28.00, 29.20, 24.90, 23.15, 28.05, 22.05, 27.50, 21.55,…
$ temp_hist      <dbl> 25.58333, 26.09000, 26.16667, 25.85667, 25.63333, 25.52…
$ temp_hist_sd   <dbl> 4.245360, 4.241218, 4.103158, 4.124311, 3.907804, 4.016…
$ temp_sd        <dbl> 4.120782, 4.120782, 4.120782, 4.120782, 4.120782, 4.120…
$ precip         <dbl> 4.005, 2.520, 2.810, 3.570, 3.765, 2.920, 2.330, 3.425,…
$ precip_hist    <dbl> 3.476667, 3.526667, 3.378000, 3.411000, 3.446333, 3.352…
$ precip_hist_sd <dbl> 1.1784719, 1.2081292, 1.1442431, 1.1620681, 1.2039309, …
$ precip_sd      <dbl> 0.9620579, 0.9620579, 0.9620579, 0.9620579, 0.9620579, …
$ outcome        <chr> "Early Spring", "Early Spring", "Early Spring", "Late W…

Your turn (2 minutes):
- Which variable(s) will we use to join the seers and weather data frames?
- We want to keep all rows and columns from seers and add columns for corresponding weather data. Which join function should we use?
Demo: Join the two data frames and assign the joined data frame to seers_weather.

seers_weather <- inner_join(
  x = seers, y = weather,
  by = join_by(state == state_abb, year)
)

Calculate the variables

Demo: Take a look at the updated seers data frame. First we need to calculate for each prediction whether or not the prognostication was correct.

seers_weather <- seers_weather |>
  mutate(correct_pred = prediction == outcome)

Demo: Calculate the accuracy rate (we’ll call it preds_rate) for weather predictions using the summarize() function in dplyr. Note that the function for calculating the mean is mean() in R.

seers_weather |> # start with seers data frame
  group_by(region) |> # group by region
  summarize(preds_rate = mean(correct_pred)) # calculate accuracy rate

# A tibble: 9 × 2
  region                      preds_rate
  <chr>                            <dbl>
1 Northeast                        0.475
2 Northern Rockies and Plains      0.511
3 Northwest                        0.44 
4 Ohio Valley                      0.544
5 South                            0.5  
6 Southeast                        0.561
7 Southwest                        0.636
8 Upper Midwest                    0.489
9 West                             0.333

Your turn (5 minutes): Now expand your calculations to also calculate the number of predictions in each region and the standard error of accuracy rate. Store this data frame as seers_summary. Recall the formula for the standard error of a sample proportion:

\[SE(\hat{p}) \approx \sqrt{\frac{(\hat{p})(1 - \hat{p})}{n}}\]

seers_summary <- seers_weather |>
  group_by(region) |>
  summarize(
    preds_rate = mean(correct_pred),
    preds_n = n(),
    preds_se = sqrt((preds_rate * (1 - preds_rate)) / preds_n)
  )
seers_summary

# A tibble: 9 × 4
  region                      preds_rate preds_n preds_se
  <chr>                            <dbl>   <int>    <dbl>
1 Northeast                        0.475     629   0.0199
2 Northern Rockies and Plains      0.511      45   0.0745
3 Northwest                        0.44       50   0.0702
4 Ohio Valley                      0.544     259   0.0309
5 South                            0.5        76   0.0574
6 Southeast                        0.561     164   0.0388
7 Southwest                        0.636      33   0.0837
8 Upper Midwest                    0.489      92   0.0521
9 West                             0.333       6   0.192

Demo: Take the seers_summary data frame and order the results in descending order of accuracy rate.

seers_summary |> # start with seers_summary data frame
  arrange(desc(preds_rate)) # order in descending order of preds_rate

# A tibble: 9 × 4
  region                      preds_rate preds_n preds_se
  <chr>                            <dbl>   <int>    <dbl>
1 Southwest                        0.636      33   0.0837
2 Southeast                        0.561     164   0.0388
3 Ohio Valley                      0.544     259   0.0309
4 Northern Rockies and Plains      0.511      45   0.0745
5 South                            0.5        76   0.0574
6 Upper Midwest                    0.489      92   0.0521
7 Northeast                        0.475     629   0.0199
8 Northwest                        0.44       50   0.0702
9 West                             0.333       6   0.192

Recreate the plot

Demo: Recreate the following plot using the data frame you have developed so far.

seers_summary |>
  mutate(region = fct_reorder(.f = region, .x = preds_rate)) |>
  ggplot(mapping = aes(x = preds_rate, y = region)) +
  geom_point(mapping = aes(size = preds_n)) +
  geom_linerange(mapping = aes(
    xmin = preds_rate - preds_se,
    xmax = preds_rate + preds_se
  )) +
  scale_x_continuous(labels = label_percent()) +
  labs(
    title = "Prognosticator accuracy rate for late winter/early spring",
    subtitle = "By climate region",
    x = "Prediction accuracy",
    y = NULL,
    size = "Total number\nof predictions",
    caption = "Source: Countdown to Groundhog Day & NOAA"
  ) +
  theme_minimal()

Your turn (time permitting): Make any other changes you would like to improve it.

# add your code here

Session information

sessioninfo::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31)
 os       macOS Ventura 13.5.2
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2024-02-10
 pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 bit           4.0.5   2022-11-15 [1] CRAN (R 4.3.0)
 bit64         4.0.5   2020-08-30 [1] CRAN (R 4.3.0)
 class         7.3-22  2023-05-03 [1] CRAN (R 4.3.2)
 classInt      0.4-10  2023-09-05 [1] CRAN (R 4.3.0)
 cli           3.6.2   2023-12-11 [1] CRAN (R 4.3.1)
 codetools     0.2-19  2023-02-01 [1] CRAN (R 4.3.2)
 colorspace  * 2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
 crayon        1.5.2   2022-09-29 [1] CRAN (R 4.3.0)
 curl          5.2.0   2023-12-08 [1] CRAN (R 4.3.1)
 DBI           1.2.0   2023-12-21 [1] CRAN (R 4.3.1)
 digest        0.6.34  2024-01-11 [1] CRAN (R 4.3.1)
 dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.3.1)
 e1071         1.7-14  2023-12-06 [1] CRAN (R 4.3.1)
 evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.1)
 fansi         1.0.6   2023-12-08 [1] CRAN (R 4.3.1)
 farver        2.1.1   2022-07-06 [1] CRAN (R 4.3.0)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
 forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
 ggplot2     * 3.4.4   2023-10-12 [1] CRAN (R 4.3.1)
 ggthemes      5.0.0   2023-11-21 [1] CRAN (R 4.3.1)
 glue          1.7.0   2024-01-09 [1] CRAN (R 4.3.1)
 gtable        0.3.4   2023-08-21 [1] CRAN (R 4.3.0)
 here          1.0.1   2020-12-13 [1] CRAN (R 4.3.0)
 hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
 htmltools     0.5.7   2023-11-03 [1] CRAN (R 4.3.1)
 htmlwidgets   1.6.4   2023-12-06 [1] CRAN (R 4.3.1)
 httr          1.4.7   2023-08-15 [1] CRAN (R 4.3.0)
 jsonlite      1.8.8   2023-12-04 [1] CRAN (R 4.3.1)
 KernSmooth    2.23-22 2023-07-10 [1] CRAN (R 4.3.2)
 knitr         1.45    2023-10-30 [1] CRAN (R 4.3.1)
 labeling      0.4.3   2023-08-29 [1] CRAN (R 4.3.0)
 lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.3.1)
 lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.3.1)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
 munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
 proxy         0.4-27  2022-06-09 [1] CRAN (R 4.3.0)
 purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
 ragg          1.2.7   2023-12-11 [1] CRAN (R 4.3.1)
 rappdirs      0.3.3   2021-01-31 [1] CRAN (R 4.3.0)
 Rcpp          1.0.12  2024-01-09 [1] CRAN (R 4.3.1)
 readr       * 2.1.5   2024-01-10 [1] CRAN (R 4.3.1)
 rlang         1.1.3   2024-01-10 [1] CRAN (R 4.3.1)
 rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.3.1)
 rprojroot     2.0.4   2023-11-05 [1] CRAN (R 4.3.1)
 rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)
 s2            1.1.6   2023-12-19 [1] CRAN (R 4.3.1)
 scales      * 1.2.1   2024-01-18 [1] Github (r-lib/scales@c8eb772)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 sf          * 1.0-15  2023-12-18 [1] CRAN (R 4.3.1)
 showtext    * 0.9-6   2023-05-03 [1] CRAN (R 4.3.0)
 showtextdb  * 3.0     2020-06-04 [1] CRAN (R 4.3.0)
 stringi       1.8.3   2023-12-11 [1] CRAN (R 4.3.1)
 stringr     * 1.5.1   2023-11-14 [1] CRAN (R 4.3.1)
 sysfonts    * 0.8.8   2022-03-13 [1] CRAN (R 4.3.0)
 systemfonts   1.0.5   2023-10-09 [1] CRAN (R 4.3.1)
 textshaping   0.3.7   2023-10-09 [1] CRAN (R 4.3.1)
 tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
 tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)
 tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)
 tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
 tigris      * 2.0.4   2023-09-22 [1] CRAN (R 4.3.1)
 timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)
 tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)
 units         0.8-5   2023-11-28 [1] CRAN (R 4.3.1)
 utf8          1.2.4   2023-10-22 [1] CRAN (R 4.3.1)
 uuid          1.1-1   2023-08-17 [1] CRAN (R 4.3.0)
 vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.3.1)
 vroom         1.6.5   2023-12-05 [1] CRAN (R 4.3.1)
 withr         2.5.2   2023-10-30 [1] CRAN (R 4.3.1)
 wk            0.9.1   2023-11-29 [1] CRAN (R 4.3.1)
 xfun          0.41    2023-11-01 [1] CRAN (R 4.3.1)
 yaml          2.3.8   2023-12-11 [1] CRAN (R 4.3.1)

 [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────

Footnotes

See ae-02 ↩︎
Source: National Weather Service Climate Prediction Center ↩︎