Working with multiple data frames

Lecture 6

Dr. Benjamin Soltoff

Cornell University
INFO 2950 - Spring 2024

February 8, 2024

Announcements

Announcements

  • Team assignments start tomorrow
    • Team project
    • Labs
  • Gradescope submission tips
    • Remember to submit on Gradescope
    • Remember to tag all pages for appropriate content
    • Regrade requests for lab-01 and hw-01

Relevant questions at this point

  • How important is understanding the readings?
    • Be familiar with the techniques, but you don’t have to master it all before class
  • Should we use viridis for all visualizations going forward, even if not explicitly requested?
    • No, and yes – it doesn’t have to be viridis but I strongly recommend choosing an optimal color palette
  • What should my code chunk labels look like?
    • Look at your application exercises

fill vs. color

  • Use fill when a geom is filled
  • Use color when a geom is outlined

Relational joins

Introduction to relational data

  • Multiple tables of data that when combined together accomplish goals
    • Flat tabular files
    • Database (e.g. SQL, NoSQL, etc.)
  • Relations define the important element, not just the individual tables
  • Relations are defined between a pair of tables
  • Relational verbs
    • Mutating joins
    • Filtering joins

Comic book characters

Name Alignment Gender Publisher
Magneto Bad Male Marvel
Batman Good Male DC
Sabrina Good Female Archie Comics

Publishers

Publisher Year founded
DC 1934
Marvel 1939
Image 1992

Mutating joins

inner_join()

inner_join()

Name Alignment Gender Publisher
Magneto Bad Male Marvel
Batman Good Male DC
Sabrina Good Female Archie Comics
Publisher Year founded
DC 1934
Marvel 1939
Image 1992
inner_join(x = superheroes, y = publishers, by = join_by(Publisher))
# A tibble: 2 × 5
  Name    Alignment Gender Publisher `Year founded`
  <chr>   <chr>     <chr>  <chr>              <dbl>
1 Magneto Bad       Male   Marvel              1939
2 Batman  Good      Male   DC                  1934

left_join()

left_join()

Name Alignment Gender Publisher
Magneto Bad Male Marvel
Batman Good Male DC
Sabrina Good Female Archie Comics
Publisher Year founded
DC 1934
Marvel 1939
Image 1992
left_join(x = superheroes, y = publishers, by = join_by(Publisher))
# A tibble: 3 × 5
  Name    Alignment Gender Publisher     `Year founded`
  <chr>   <chr>     <chr>  <chr>                  <dbl>
1 Magneto Bad       Male   Marvel                  1939
2 Batman  Good      Male   DC                      1934
3 Sabrina Good      Female Archie Comics             NA

right_join()

right_join()

Name Alignment Gender Publisher
Magneto Bad Male Marvel
Batman Good Male DC
Sabrina Good Female Archie Comics
Publisher Year founded
DC 1934
Marvel 1939
Image 1992
right_join(x = superheroes, y = publishers, by = join_by(Publisher))
# A tibble: 3 × 5
  Name    Alignment Gender Publisher `Year founded`
  <chr>   <chr>     <chr>  <chr>              <dbl>
1 Magneto Bad       Male   Marvel              1939
2 Batman  Good      Male   DC                  1934
3 <NA>    <NA>      <NA>   Image               1992

right_join() reversed

full_join()

full_join()

Name Alignment Gender Publisher
Magneto Bad Male Marvel
Batman Good Male DC
Sabrina Good Female Archie Comics
Publisher Year founded
DC 1934
Marvel 1939
Image 1992
full_join(x = superheroes, y = publishers, by = join_by(Publisher))
# A tibble: 4 × 5
  Name    Alignment Gender Publisher     `Year founded`
  <chr>   <chr>     <chr>  <chr>                  <dbl>
1 Magneto Bad       Male   Marvel                  1939
2 Batman  Good      Male   DC                      1934
3 Sabrina Good      Female Archie Comics             NA
4 <NA>    <NA>      <NA>   Image                   1992

Filtering joins

semi_join()

semi_join()

Name Alignment Gender Publisher
Magneto Bad Male Marvel
Batman Good Male DC
Sabrina Good Female Archie Comics
Publisher Year founded
DC 1934
Marvel 1939
Image 1992
semi_join(x = superheroes, y = publishers, by = join_by(Publisher))
# A tibble: 2 × 4
  Name    Alignment Gender Publisher
  <chr>   <chr>     <chr>  <chr>    
1 Magneto Bad       Male   Marvel   
2 Batman  Good      Male   DC       

anti_join()

anti_join()

Name Alignment Gender Publisher
Magneto Bad Male Marvel
Batman Good Male DC
Sabrina Good Female Archie Comics
Publisher Year founded
DC 1934
Marvel 1939
Image 1992
anti_join(x = superheroes, y = publishers, by = join_by(Publisher))
# A tibble: 1 × 4
  Name    Alignment Gender Publisher    
  <chr>   <chr>     <chr>  <chr>        
1 Sabrina Good      Female Archie Comics

Application exercise

Goal

Join data from multiple data frames, summarize it, and create this plot.

ae-04

  • Go to the course GitHub org and find your ae-04 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio Workbench, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of tomorrow.

Recap of AE

  • Use the *_join() function appropriate for your analysis
  • Leverage the difference between mutating and filtering joins

TidyTuesday

A logo for Tidy Tuesday. Describes Tidy Tuesday as a weekly data project in R from the R for Data Science online learning community.