AE 02: Visualizing the prognosticators

Application exercise

Modified

March 6, 2024

Important

Go to the course GitHub organization and locate the repo titled ae-02-YOUR_GITHUB_USERNAME to get started.

This AE is due February 2 🎩 at 11:59pm.

For all analyses, we’ll use the tidyverse packages.

library(tidyverse)
library(scales)

Data: The prognosticators

The dataset we will visualize is called seers.¹ It contains summary statistics for all known Groundhog Day forecasters. ² Let’s glimpse() at it .

# import data using readr::read_csv()
seers <- read_csv("data/prognosticators-sum-stats.csv")

Rows: 138 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): name, forecaster_type, forecaster_simple, climate_region, town, state
dbl (11): preds_n, preds_long_winter, preds_long_winter_pct, preds_correct, ...
lgl  (1): alive

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# add code here

The variables are:

name - name of the prognosticator
forecaster_type - what kind of animal or thing is the prognosticator?
forecaster_simple - a simplified version that lumps together the least-frequently appearing types of prognosticators
alive - is the prognosticator an animate (alive) being?³
climate_region - the NOAA climate region in which the prognosticator is located.
town - self-explanatory
state - state (or territory) where prognosticator is located
preds_n - number of predictions in the database
preds_long_winter - number of predictions for a “Late Winter” (as opposed to “Early Spring”)
preds_long_winter_pct - percentage of predictions for a “Late Winter”
preds_correct - number of correct predictions⁴
preds_rate - proportion of predictions that are correct
temp_mean - average temperature (in Fahrenheit) in February and March in the climate region across all prognostication years
temp_hist - average of the rolling 15-year historic average temperature in February and March across all prognostication years
temp_sd - standard deviation of average February and March temperatures across all prognostication years
precip_mean - average amount of precipitation in February and March across all prognostication years (measured in rainfall inches)
precip_hist average of the rolling 15-year historic average precipitation in February and March across all prognostication years
precip_sd - standard deviation of average February and March precipitation across all prognostication years

Visualizing prediction success rate - Demo

Single variable

Note

Analyzing the a single variable is called univariate analysis.

Create visualizations of the distribution of preds_rate for the prognosticators.

Make a histogram. Set an appropriate binwidth.

# add code here

Two variables - Your turn

Note

Analyzing the relationship between two variables is called bivariate analysis.

Create visualizations of the distribution of preds_rate by alive (whether or not the prognosticator is alive).

Make a single histogram. Set an appropriate binwidth.

# add code here

Use multiple histograms via faceting, one for each type. Set an appropriate binwidth, add color as you see fit, and turn off legends if not needed.

# add code here

Use side-by-side box plots. Add color as you see fit and turn off legends if not needed.

# add code here

Use a density plot. Add color as you see fit.

# add code here

Use a violin plot. Add color as you see fit and turn off legends if not needed.

# add code here

Make a jittered scatter plot. Add color as you see fit and turn off legends if not needed.

# add code here

Use beeswarm plots. Add color as you see fit and turn off legends if not needed.

library(ggbeeswarm)

# add code here

Demonstration: Use multiple geoms on a single plot. Be deliberate about the order of plotting. Change the theme and the color scale of the plot. Finally, add informative labels.

# add code here

Multiple variables - Demo

Note

Analyzing the relationship between three or more variables is called multivariate analysis.

Facet the plot you created in the previous exercise by forecaster_simple. Adjust labels accordingly.

# add code here

Before you continue, let’s turn off all warnings the code chunks generate and resize all figures. We’ll do this by editing the YAML.

Visualizing other variables - Your turn!

Pick a single categorical variable from the data set and make a bar plot of its distribution.

# add code here

Pick two categorical variables and make a visualization to visualize the relationship between the two variables. Along with your code and output, provide an interpretation of the visualization.

# add code here

Interpretation goes here…

Make another plot that uses at least three variables. At least one should be numeric and at least one categorical. In 1-2 sentences, describe what the plot shows about the relationships between the variables you plotted. Don’t forget to label your code chunk.

# add code here

Interpretation goes here…

Footnotes

I would prefer prognosticators, but I had way too many typos preparing these materials to make you all use it.↩︎
Source: Countdown to Groundhog Day. Application exercise inspired by Groundhogs Do Not Make Good Meteorologists originally published on FiveThirtyEight.↩︎
Prognosticators labeled as Animatronic/Puppet/Statue/Stuffed/Taxidermied are classified as not alive.↩︎
We adopt the same definition as FiveThirtyEight. An “Early Spring” is defined as any year in which the average temperature in either February or March was higher than the historic average. A “Late Winter” was when the average temperature in both months was lower than or the same as the historical average.↩︎