Importing and recoding data

Lecture 9

Dr. Benjamin Soltoff

Cornell University
INFO 2950 - Spring 2024

February 20, 2024

Announcements

Announcements

  • Continue making progress on your projects
  • Lab 03 on Friday

Is \(x\) in \(y\)?

major_income_undergrad <- read_csv(file = "data/major_income_undergrad.csv")

stem_categories <- c(
  "Biology & Life Science",
  "Computers & Mathematics",
  "Engineering",
  "Physical Sciences"
)

Is \(x\) in \(y\)?

major_income_undergrad <- major_income_undergrad |>
  mutate(major_type = if_else(major_category == stem_categories,
    "STEM", "Not STEM"
  ))

major_income_undergrad |>
  filter(major_type == "STEM", undergrad_median < 55000) |>
  select(major, major_type, undergrad_median) |>
  arrange(desc(undergrad_median))
# A tibble: 1 × 3
  major   major_type undergrad_median
  <chr>   <chr>                 <dbl>
1 Biology STEM                  54000
major_income_undergrad <- major_income_undergrad |>
  mutate(major_type = if_else(major_category %in% stem_categories,
    "STEM", "Not STEM"
  ))

major_income_undergrad |>
  filter(major_type == "STEM", undergrad_median < 55000) |>
  select(major, major_type, undergrad_median) |>
  arrange(desc(undergrad_median))
# A tibble: 7 × 3
  major                      major_type undergrad_median
  <chr>                      <chr>                 <dbl>
1 Biology                    STEM                  54000
2 Communication Technologies STEM                  52000
3 Botany                     STEM                  50000
4 Physiology                 STEM                  50000
5 Molecular Biology          STEM                  50000
6 Ecology                    STEM                  48700
7 Neuroscience               STEM                  40000

Data “wrangling”

A screenshot of a New York Times article.

A screenshot of 'Data Carpentry' by David Mimno.

Reading data into R

Reading rectangular data

  • readr:
    • Most commonly: read_csv()
    • Maybe also: read_tsv(), read_delim(), etc.
  • readxl: read_excel()
  • arrow: read_arrow(), read_parquet()
  • haven: read_sas(), read_sav(), read_dta()
  • googlesheets4: read_sheet()
  • data.table: fread()1

Application exercise

Powerball Lottery

Powerball Lottery

Powerball Lottery

ae-07

  • Go to the course GitHub org and find your ae-07 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio Workbench, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of tomorrow.

Recap of AE

  • Simplify your life – get the data in as simple a format as possible
  • Examine the file’s structure before attempting to import into R. Use the RStudio interactive menu as necessary.
  • Ensure all data cleaning is reproducible. Do not replace your raw data files.

Ariel the cat