Lab 01 - Data visualization

Lab

Modified

March 6, 2024

Important

This lab is due February 5 at 11:59pm ET.

Learning goals

In this lab, you will…

learn how to effectively visualize numeric and categorical data.
continue developing a workflow for reproducible data analysis.

Getting started

Go to the info2950-sp24 organization on GitHub. Click on the repo with the prefix lab-01. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 0 instructions for details on cloning a repo and starting a new R project.

If you are completing the assignment with a peer

You may complete a lab assignment collaboratively with up to three peers in the class (maximum group size of four). If you choose to do so, please submit the assignment only once in Gradescope. You should also revise the YAML header (the author argument only) in your Quarto document so you can list all the students who worked on the assignment.

author: 
  - "Student 1 (netID)"
  - "Student 2 (netID)"
  - "Student 3 (netID)"
  - "Student 4 (netID)"

Packages

We will use the tidyverse package to create and customize plots in R, viridis for custom color palettes, and scales to format axis labels.

library(tidyverse)
library(viridis)
library(scales)

Data: Exploring higher education statistics

We will use data from College Scorecard.¹ The subset we will analyze contains a small number of metrics for all four-year colleges and universities in the United States for the 2021-22 academic year. ²

The data is stored in scorecard.csv. To import the data, use the read_csv() function.

scorecard <- read_csv("data/scorecard.csv")

Exercises

As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled with appropriate units indicated (e.g. %, $), and careful consideration should be given to aesthetic choices.

In addition, the code should not exceed the 80 character limit, so that all the code can be read when you render to PDF. To help with this, you can add a vertical line at 80 characters by clicking “Tools” $\rightarrow$ “Global Options” $\rightarrow$ “Code” $\rightarrow$ “Display”, then set “Margin Column” to 80, and click “Apply”.

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course. There will be periodic reminders in this assignment to remind you to render, commit, and push your changes to GitHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.

Make a histogram to visualize the net cost of attendance. Set the binwidth to 1,000 and include axes labels and a title.
- Describe the shape of the distribution.
- Does there appear to be are any outliers? Briefly explain.

Note

For more details and code examples for histograms, use the ggplot2 reference page.

Create a scatterplot of the net cost of attendance (net_cost) versus average faculty salary (avg_fac_sal) with points colored by type of university type (i.e. public, private non-profit, or private for-profit). Label the axes and legend and give the plot a title. Use the scale_color_viridis_d() function to apply the viridis color palette to your plot.

Note

See Introduction to the viridis color maps to read more about the viridis R package and see code examples.

Render, commit, and push your changes to GitHub with the commit message “Added answer for Ex 1-2”.

Make sure to commit and push all changed files so that your Git pane is empty afterwards.

Describe what you observe in the plot from the previous exercise. In your description, include similarities and differences in the patterns across college types. Is this an effective graph to assess these patterns? Why or why not?
Now, let’s examine the relationship between the same two variables, using a separate plot for each type. Label the axes and give the plot a title. Use geom_smooth() with the argument se = FALSE to add a smooth curve fit to the data. Which plot do you prefer - this plot or the plot in Ex 2? Briefly explain your choice.

Note

se = FALSE removes the confidence bands around the line. These bands show the uncertainty around the smooth curve. We’ll discuss uncertainty around estimates later in the course and bring these bands back then.

Now is another good time to render, commit, and push your changes to GitHub with a meaningful commit message.

Once again, make sure to commit and push all changed files so that your Git pane is empty afterwards.

Do students leaving some types of colleges tend to have higher debt loads than others? To explore this question, create side-by-side boxplots of median student debt (debt) of a college based on type (type).
- Describe what you observe from the plot.
- Which type of college has the single highest median student debt? How do you know based on the plot?
Are some types of colleges more likely to be located in a city? Create a segmented bar chart with one bar per type of college and the fill determined by the distribution of locale, which identifies if the college is located in a city, suburb, town, or rural areas. The $y$ axis of the segmented barplot should range from 0 to 1.³
- What do you notice from the plot?

Note

For this exercise, you should begin with the data wrangling code below. We will learn more about data wrangling next week.

scorecard <- scorecard |>
  drop_na(locale)

Now is another good time to render, commit, and push your changes to GitHub with a meaningful commit message.

And once again, make sure to commit and push all changed files so that your Git pane is empty afterwards. We keep repeating this because it’s important, and because we see students forget to do this. So take a moment to make sure you’re following along with the instructions around Git.

Recreate the plot below.

Hints:
- The ggplot2 reference for themes will be helpful in determining the theme.
- The size of the points is 2.
- The transparency (alpha) of the points is 0.5.
- The scales reference for axis labels has several functions for formatting ggplot2 scales.
- You can put line breaks in labels with \n.
- You do not have to perfectly match the aspect ratio.

Render, commit, and push your final changes to GitHub with a meaningful commit message.

Make sure to commit and push all changed files so that your Git pane is empty afterwards.

Submission

Once you are finished with the lab, you will submit your final PDF document to Gradescope.

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

You must turn in a PDF file to the Gradescope page by the submission deadline to be considered “on time”.

To submit your assignment:

Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials $\rightarrow$ Cornell University NetID and log in using your NetID credentials.
Click on your INFO 2950 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark all the pages associated with exercise. All the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.

If you worked with another student(s) on the assignment

Follow the Gradescope instructions to include all students on the submission.

Grading (50 pts)

Component	Points
Ex 1	4
Ex 2	6
Ex 3	4
Ex 4	8
Ex 5	6
Ex 6	6
Ex 7	8
Workflow & formatting	8

Note

The “Workflow & formatting” component assesses the reproducible workflow. This includes:

Following tidyverse code style
All code being visible in rendered PDF (no more than 80 characters)
Appropriate figure sizing, and figures with informative labels and legends

Footnotes

College Scorecard is a product of the U.S. Department of Education and compiles detailed information about student completion, debt and repayment, earnings, and more for all degree-granting institutions across the country.↩︎
The full database contains thousands of variables from 1996-2022.↩︎
Or from 0 to 100 if you express the $y$ axis in percentages.↩︎