Grammar of data wrangling

Lecture 5

Dr. Benjamin Soltoff

Cornell University
INFO 2950 - Summer 2024

June 5, 2024



  • Suggested solutions posted for ae-01 and ae-02
  • Homework 01 due tonight by 11:59pm
  • Ask course questions on GitHub Discussion

Clarification for hw-01

Home age variable

tompkins <- tompkins |>
  mutate(home_age = if_else(year_built < 1960, "Before 1960", "Newer than 1960"))
Create a new column of data
Save the modified data frame as tompkins

Ordering of health categories

brfss <- brfss |>
    general_health = as.factor(general_health),
    general_health = fct_relevel(general_health, "Excellent", "Very good",
                                 "Good", "Fair", "Poor")
Modify a column of data
Ensure specific ordering of general_health when plotted
Save the modified data frame as brfss

Coding style + workflow

  • Avoid long lines of code

    • We should be able to see all of your code in the PDF document you submit.
  • Use the tidyverse style guide and styler

  • Render, commit, and push regularly

    • Think about it like clicking to save regularly as you type a report

Why data wrangling matters

Application exercise

College education

A Cornell student wearing a red coat walking across the Arts Quad in the snow.


  • Go to the course GitHub org and find your ae-03 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio Workbench, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of the day.

Recap of AE

  • The pipe operator, |>, can be read as “and then”.
  • The pipe operator passes what comes before it into the function that comes after it as the first argument in that function.
sum(1, 2)
[1] 3
1 |> 
[1] 3
  • Always use a line break after the pipe, and indent the next line of code.
    • Just like always use a line break between layers of ggplots, after +, and indent the next line.
  • Use dplyr functions to transform your data