library(tidyverse)
AE 03: Wrangling college education metrics
Go to the course GitHub organization and locate the repo titled ae-03-YOUR_GITHUB_USERNAME
to get started.
This AE is due February 7 at 11:59pm.
To demonstrate data wrangling we will use data from College Scorecard.1 The subset we will analyze contains a small number of metrics for all four-year colleges and universities in the United States for the 2021-22 academic year.2
The data is stored in scorecard.csv
. The variables are:
unit_id
- Unit ID for institutionname
- Name of the collegestate
- State abbreviationtype
- Type of college (Public; Private, nonprofit; Private, for-profit)adm_rate
- Undergraduate admissions rate (from 0-100%)sat_avg
- Average SAT equivalent score of students admittedcost
- The average annual total cost of attendance, including tuition and fees, books and supplies, and living expensesnet_cost
- The average annual net cost of attendance (annual cost of attendance minus the average grant/scholarship aid)avg_fac_sal
- Average faculty salary (9 month)pct_pell
- Percentage of undergraduates who receive a Pell Grantcomp_rate
- Rate of first-time, full-time students at four-year institutions who complete their degree within six yearsfirst_gen
- Share of first-generation studentsdebt
- Median debt of students after leaving schoollocale
- Locale of institution
<- read_csv("data/scorecard.csv") scorecard
The data frame has over 1700 observations (rows), 1719 observations to be exact, so we will not view the entire data frame. Instead we’ll use the commands below to help us explore the data.
glimpse(scorecard)
Rows: 1,719
Columns: 14
$ unit_id <dbl> 100654, 100663, 100706, 100724, 100751, 100830, 100858, 10…
$ name <chr> "Alabama A & M University", "University of Alabama at Birm…
$ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ type <chr> "Public", "Public", "Public", "Public", "Public", "Public"…
$ adm_rate <dbl> 0.7160, 0.8854, 0.7367, 0.9799, 0.7890, 0.9680, 0.7118, 0.…
$ sat_avg <dbl> 954, 1266, 1300, 955, 1244, 1069, NA, 1214, 1042, NA, 1111…
$ cost <dbl> 21924, 26248, 24869, 21938, 31050, 20621, 32678, 33920, 36…
$ net_cost <dbl> 13057, 16585, 17250, 13593, 21534, 13689, 23258, 21098, 20…
$ avg_fac_sal <dbl> 79011, 104310, 88380, 69309, 94581, 70965, 99837, 68724, 5…
$ pct_pell <dbl> 0.6853, 0.3253, 0.2377, 0.7205, 0.1712, 0.4821, 0.1301, 0.…
$ comp_rate <dbl> 0.2807, 0.6245, 0.6072, 0.2843, 0.7223, 0.3569, 0.8088, 0.…
$ first_gen <dbl> 0.3658281, 0.3412237, 0.3101322, 0.3434343, 0.2257127, 0.3…
$ debt <dbl> 16600, 15832, 13905, 17500, 17986, 13119, 17750, 16000, 15…
$ locale <chr> "City", "City", "City", "City", "City", "City", "City", "C…
names(scorecard)
[1] "unit_id" "name" "state" "type" "adm_rate"
[6] "sat_avg" "cost" "net_cost" "avg_fac_sal" "pct_pell"
[11] "comp_rate" "first_gen" "debt" "locale"
head(scorecard)
# A tibble: 6 × 14
unit_id name state type adm_rate sat_avg cost net_cost avg_fac_sal pct_pell
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100654 Alab… AL Publ… 0.716 954 21924 13057 79011 0.685
2 100663 Univ… AL Publ… 0.885 1266 26248 16585 104310 0.325
3 100706 Univ… AL Publ… 0.737 1300 24869 17250 88380 0.238
4 100724 Alab… AL Publ… 0.980 955 21938 13593 69309 0.720
5 100751 The … AL Publ… 0.789 1244 31050 21534 94581 0.171
6 100830 Aubu… AL Publ… 0.968 1069 20621 13689 70965 0.482
# ℹ 4 more variables: comp_rate <dbl>, first_gen <dbl>, debt <dbl>,
# locale <chr>
The head()
function returns “A tibble: 6 x 14” and then the first six rows of the scorecard
data.
Tibble vs. data frame
A tibble is an opinionated version of the R
data frame. In other words, all tibbles are data frames, but not all data frames are tibbles!
There are two main differences between a tibble and a data frame:
When you print a tibble, the first ten rows and all of the columns that fit on the screen will display, along with the type of each column.
Let’s look at the differences in the output when we type
scorecard
(tibble) in the console versus typingcars
(data frame) in the console.Second, tibbles are somewhat more strict than data frames when it comes to subsetting data. You will get a warning message if you try to access a variable that doesn’t exist in a tibble. You will get
NULL
if you try to access a variable that doesn’t exist in a data frame.
$apple scorecard
Warning: Unknown or uninitialised column: `apple`.
NULL
$apple cars
NULL
Data wrangling with dplyr
dplyr is the primary package in the tidyverse for data wrangling.
Quick summary of key dplyr functions3
Rows:
filter()
:chooses rows based on column values.slice()
: chooses rows based on location.arrange()
: changes the order of the rowssample_n()
: take a random subset of the rows
Columns:
select()
: changes whether or not a column is included.rename()
: changes the name of columns.mutate()
: changes the values of columns and creates new columns.
Groups of rows:
summarize()
: collapses a group into a single row.count()
: count unique values of one or more variables.group_by()
: perform calculations separately for each value of a variable
Operators
In order to make comparisons, we will use logical operators. These should be familiar from other programming languages. See below for a reference table for how to use these operators in R.
operator | definition |
---|---|
< |
is less than? |
<= |
is less than or equal to? |
> |
is greater than? |
>= |
is greater than or equal to? |
== |
is exactly equal to? |
!= |
is not equal to? |
x & y |
is x AND y? |
x | y |
is x OR y? |
is.na(x) |
is x NA? |
!is.na(x) |
is x not NA? |
x %in% y |
is x in y? |
!(x %in% y) |
is x not in y? |
!x |
is not x? |
The final operator only makes sense if x
is logical (TRUE / FALSE).
The pipe
Before working with data wrangling functions, let’s formally introduce the pipe. The pipe, |>
, is an operator (a tool) for passing information from one process to another. We will use |>
mainly in data pipelines to pass the output of the previous line of code as the first input of the next line of code.
When reading code “in English”, say “and then” whenever you see a pipe.
- Your turn (3 minutes): Run the following chunk and observe its output. Then, come up with a different way of obtaining the same output.
|>
scorecard select(name, type) |>
head()
# A tibble: 6 × 2
name type
<chr> <chr>
1 Alabama A & M University Public
2 University of Alabama at Birmingham Public
3 University of Alabama in Huntsville Public
4 Alabama State University Public
5 The University of Alabama Public
6 Auburn University at Montgomery Public
# add code here
Exercises
Single function transformations
Demo: Select the name
column.
# add code here
Demo: Select all columns except unit_id
.
# add code here
Demo: Filter the data frame to keep only schools with a greater than 40% share of first-generation students.
# add code here
Your turn: Filter the data frame to keep only public schools with a net cost of attendance below $12,000.
# add code here
Multiple function transformations
Your turn: How many public colleges and universities in each state have a net cost of attendance below $12,000?
# add code here
Your turn: Generate a data frame with the 10 most expensive colleges in 2021-22 based on net cost of attendance.
# add code here
Your turn: Generate a data frame with the average SAT score for each type of college.
Note that since the sat_avg
column contains NA
s (missing values), we need to explicitly exclude them from our mean calculation. Otherwise the resulting data frame contains NA
s.
# add code here
Your turn: Calculate for each school how many students it takes to pay the average faculty member’s salary and generate a data frame with the school’s name, net cost of attendance, average faculty salary, and the calculated value. How many Cornell and Ithaca College students does it take to pay their average faculty member’s salary?
You should use the net cost of attendance measure, not the sticker price.
# add code here
Your turn: Calculate how many private, nonprofit schools have a smaller net cost than Cornell University.
You will need to create a new column that ranks the schools by net cost of attendance. Look at the back of the dplyr cheatsheet for functions that can be used to calculate rankings.
# add code here
Footnotes
College Scorecard is a product of the U.S. Department of Education and compiles detailed information about student completion, debt and repayment, earnings, and more for all degree-granting institutions across the country.↩︎
The full database contains thousands of variables from 1996-2022.↩︎
From dplyr vignette↩︎