AE 03: Wrangling college education metrics

Suggested answers

Application exercise

February 6, 2024


These are suggested answers. This document should be used as reference only, it’s not designed to be an exhaustive key.

To demonstrate data wrangling we will use data from College Scorecard.1 The subset we will analyze contains a small number of metrics for all four-year colleges and universities in the United States for the 2021-22 academic year. 2


The data is stored in scorecard.csv. The variables are:

scorecard <- read_csv("data/scorecard.csv")

The data frame has over 1700 observations (rows), 1719 observations to be exact, so we will not view the entire data frame. Instead we’ll use the commands below to help us explore the data.

Rows: 1,719
Columns: 14
$ unit_id     <dbl> 100654, 100663, 100706, 100724, 100751, 100830, 100858, 10…
$ name        <chr> "Alabama A & M University", "University of Alabama at Birm…
$ state       <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ type        <chr> "Public", "Public", "Public", "Public", "Public", "Public"…
$ adm_rate    <dbl> 0.7160, 0.8854, 0.7367, 0.9799, 0.7890, 0.9680, 0.7118, 0.…
$ sat_avg     <dbl> 954, 1266, 1300, 955, 1244, 1069, NA, 1214, 1042, NA, 1111…
$ cost        <dbl> 21924, 26248, 24869, 21938, 31050, 20621, 32678, 33920, 36…
$ net_cost    <dbl> 13057, 16585, 17250, 13593, 21534, 13689, 23258, 21098, 20…
$ avg_fac_sal <dbl> 79011, 104310, 88380, 69309, 94581, 70965, 99837, 68724, 5…
$ pct_pell    <dbl> 0.6853, 0.3253, 0.2377, 0.7205, 0.1712, 0.4821, 0.1301, 0.…
$ comp_rate   <dbl> 0.2807, 0.6245, 0.6072, 0.2843, 0.7223, 0.3569, 0.8088, 0.…
$ first_gen   <dbl> 0.3658281, 0.3412237, 0.3101322, 0.3434343, 0.2257127, 0.3…
$ debt        <dbl> 16600, 15832, 13905, 17500, 17986, 13119, 17750, 16000, 15…
$ locale      <chr> "City", "City", "City", "City", "City", "City", "City", "C…
 [1] "unit_id"     "name"        "state"       "type"        "adm_rate"   
 [6] "sat_avg"     "cost"        "net_cost"    "avg_fac_sal" "pct_pell"   
[11] "comp_rate"   "first_gen"   "debt"        "locale"     
# A tibble: 6 × 14
  unit_id name  state type  adm_rate sat_avg  cost net_cost avg_fac_sal pct_pell
    <dbl> <chr> <chr> <chr>    <dbl>   <dbl> <dbl>    <dbl>       <dbl>    <dbl>
1  100654 Alab… AL    Publ…    0.716     954 21924    13057       79011    0.685
2  100663 Univ… AL    Publ…    0.885    1266 26248    16585      104310    0.325
3  100706 Univ… AL    Publ…    0.737    1300 24869    17250       88380    0.238
4  100724 Alab… AL    Publ…    0.980     955 21938    13593       69309    0.720
5  100751 The … AL    Publ…    0.789    1244 31050    21534       94581    0.171
6  100830 Aubu… AL    Publ…    0.968    1069 20621    13689       70965    0.482
# ℹ 4 more variables: comp_rate <dbl>, first_gen <dbl>, debt <dbl>,
#   locale <chr>

The head() function returns “A tibble: 6 x 14” and then the first six rows of the scorecard data.

Tibble vs. data frame

A tibble is an opinionated version of the R data frame. In other words, all tibbles are data frames, but not all data frames are tibbles!

There are two main differences between a tibble and a data frame:

  1. When you print a tibble, the first ten rows and all of the columns that fit on the screen will display, along with the type of each column.

    Let’s look at the differences in the output when we type scorecard (tibble) in the console versus typing cars (data frame) in the console.

  2. Second, tibbles are somewhat more strict than data frames when it comes to subsetting data. You will get a warning message if you try to access a variable that doesn’t exist in a tibble. You will get NULL if you try to access a variable that doesn’t exist in a data frame.

Warning: Unknown or uninitialised column: `apple`.

Data wrangling with dplyr

dplyr is the primary package in the tidyverse for data wrangling.

Helpful data wrangling resources

Quick summary of key dplyr functions3


  • filter():chooses rows based on column values.
  • slice(): chooses rows based on location.
  • arrange(): changes the order of the rows
  • sample_n(): take a random subset of the rows


  • select(): changes whether or not a column is included.
  • rename(): changes the name of columns.
  • mutate(): changes the values of columns and creates new columns.

Groups of rows:

  • summarize(): collapses a group into a single row.
  • count(): count unique values of one or more variables.
  • group_by(): perform calculations separately for each value of a variable


In order to make comparisons, we will use logical operators. These should be familiar from other programming languages. See below for a reference table for how to use these operators in R.

operator definition
< is less than?
<= is less than or equal to?
> is greater than?
>= is greater than or equal to?
== is exactly equal to?
!= is not equal to?
x & y is x AND y?
x | y is x OR y? is x NA?
! is x not NA?
x %in% y is x in y?
!(x %in% y) is x not in y?
!x is not x?

The final operator only makes sense if x is logical (TRUE / FALSE).

The pipe

Before working with data wrangling functions, let’s formally introduce the pipe. The pipe, |>, is an operator (a tool) for passing information from one process to another. We will use |> mainly in data pipelines to pass the output of the previous line of code as the first input of the next line of code.

When reading code “in English”, say “and then” whenever you see a pipe.

  • Your turn (3 minutes): Run the following chunk and observe its output. Then, come up with a different way of obtaining the same output.
scorecard |>
  select(name, type) |>
# A tibble: 6 × 2
  name                                type  
  <chr>                               <chr> 
1 Alabama A & M University            Public
2 University of Alabama at Birmingham Public
3 University of Alabama in Huntsville Public
4 Alabama State University            Public
5 The University of Alabama           Public
6 Auburn University at Montgomery     Public


Single function transformations

Demo: Select the name column.

select(.data = scorecard, name)
# A tibble: 1,719 × 1
 1 Alabama A & M University           
 2 University of Alabama at Birmingham
 3 University of Alabama in Huntsville
 4 Alabama State University           
 5 The University of Alabama          
 6 Auburn University at Montgomery    
 7 Auburn University                  
 8 Birmingham-Southern College        
 9 Faulkner University                
10 Herzing University-Birmingham      
# ℹ 1,709 more rows

Demo: Select all columns except unit_id.

select(.data = scorecard, -unit_id)
# A tibble: 1,719 × 13
   name         state type  adm_rate sat_avg  cost net_cost avg_fac_sal pct_pell
   <chr>        <chr> <chr>    <dbl>   <dbl> <dbl>    <dbl>       <dbl>    <dbl>
 1 Alabama A &… AL    Publ…    0.716     954 21924    13057       79011    0.685
 2 University … AL    Publ…    0.885    1266 26248    16585      104310    0.325
 3 University … AL    Publ…    0.737    1300 24869    17250       88380    0.238
 4 Alabama Sta… AL    Publ…    0.980     955 21938    13593       69309    0.720
 5 The Univers… AL    Publ…    0.789    1244 31050    21534       94581    0.171
 6 Auburn Univ… AL    Publ…    0.968    1069 20621    13689       70965    0.482
 7 Auburn Univ… AL    Publ…    0.712      NA 32678    23258       99837    0.130
 8 Birmingham-… AL    Priv…    0.659    1214 33920    21098       68724    0.215
 9 Faulkner Un… AL    Priv…    0.646    1042 36457    20371       56439    0.461
10 Herzing Uni… AL    Priv…    0.938      NA 31202    26887       59796    0.675
# ℹ 1,709 more rows
# ℹ 4 more variables: comp_rate <dbl>, first_gen <dbl>, debt <dbl>,
#   locale <chr>

Demo: Filter the data frame to keep only schools with a greater than 40% share of first-generation students.

filter(.data = scorecard, first_gen > .40)
# A tibble: 355 × 14
   unit_id name          state type  adm_rate sat_avg  cost net_cost avg_fac_sal
     <dbl> <chr>         <chr> <chr>    <dbl>   <dbl> <dbl>    <dbl>       <dbl>
 1  101189 Faulkner Uni… AL    Priv…    0.646    1042 36457    20371       56439
 2  101365 Herzing Univ… AL    Priv…    0.938      NA 31202    26887       59796
 3  101587 University o… AL    Publ…    0.743    1017 22525    16258       60570
 4  102270 Stillman Col… AL    Priv…    0.758      NA 24540    17127       45990
 5  104717 Grand Canyon… AZ    Priv…    0.828      NA 31393    21176       62604
 6  106467 Arkansas Tec… AR    Publ…    0.944      NA 19857    12191       61605
 7  107983 Southern Ark… AR    Publ…    0.626    1094 23732    15873       63387
 8  110361 California B… CA    Priv…    0.637      NA 46274    21812       91359
 9  110486 California S… CA    Publ…    0.852      NA 18172     6821       85212
10  110495 California S… CA    Publ…    0.948      NA 17005     6041       86751
# ℹ 345 more rows
# ℹ 5 more variables: pct_pell <dbl>, comp_rate <dbl>, first_gen <dbl>,
#   debt <dbl>, locale <chr>

Your turn: Filter the data frame to keep only public schools with a net cost of attendance below $12,000.

filter(.data = scorecard, type == "Public", net_cost < 12000)
# A tibble: 156 × 14
   unit_id name          state type  adm_rate sat_avg  cost net_cost avg_fac_sal
     <dbl> <chr>         <chr> <chr>    <dbl>   <dbl> <dbl>    <dbl>       <dbl>
 1  102614 University o… AK    Publ…    0.647    1169 19621     9684       86130
 2  102632 University o… AK    Publ…    0.562      NA 16275     7368       75924
 3  106412 University o… AR    Publ…    0.675     902 18873    10542       54072
 4  110486 California S… CA    Publ…    0.852      NA 18172     6821       85212
 5  110495 California S… CA    Publ…    0.948      NA 17005     6041       86751
 6  110510 California S… CA    Publ…    0.910      NA 13365     2232       89757
 7  110529 California S… CA    Publ…    0.606      NA 22108    11740       95004
 8  110547 California S… CA    Publ…    0.896      NA 14579     3724       88389
 9  110556 California S… CA    Publ…    0.973      NA 17014     5640       85914
10  110565 California S… CA    Publ…    0.594      NA 14182     4211       93177
# ℹ 146 more rows
# ℹ 5 more variables: pct_pell <dbl>, comp_rate <dbl>, first_gen <dbl>,
#   debt <dbl>, locale <chr>
filter(.data = scorecard, type == "Public" & net_cost < 12000)
# A tibble: 156 × 14
   unit_id name          state type  adm_rate sat_avg  cost net_cost avg_fac_sal
     <dbl> <chr>         <chr> <chr>    <dbl>   <dbl> <dbl>    <dbl>       <dbl>
 1  102614 University o… AK    Publ…    0.647    1169 19621     9684       86130
 2  102632 University o… AK    Publ…    0.562      NA 16275     7368       75924
 3  106412 University o… AR    Publ…    0.675     902 18873    10542       54072
 4  110486 California S… CA    Publ…    0.852      NA 18172     6821       85212
 5  110495 California S… CA    Publ…    0.948      NA 17005     6041       86751
 6  110510 California S… CA    Publ…    0.910      NA 13365     2232       89757
 7  110529 California S… CA    Publ…    0.606      NA 22108    11740       95004
 8  110547 California S… CA    Publ…    0.896      NA 14579     3724       88389
 9  110556 California S… CA    Publ…    0.973      NA 17014     5640       85914
10  110565 California S… CA    Publ…    0.594      NA 14182     4211       93177
# ℹ 146 more rows
# ℹ 5 more variables: pct_pell <dbl>, comp_rate <dbl>, first_gen <dbl>,
#   debt <dbl>, locale <chr>

Multiple function transformations

Your turn: How many public colleges and universities in each state have a net cost of attendance below $12,000?

# using group_by() and summarize()
scorecard |>
  filter(type == "Public", net_cost < 12000) |>
  group_by(state) |>
  summarize(n = n())
# A tibble: 35 × 2
   state     n
   <chr> <int>
 1 AK        2
 2 AR        1
 3 CA       18
 4 CT        4
 5 DE        1
 6 FL       12
 7 FM        1
 8 GA        6
 9 IL        4
10 IN        9
# ℹ 25 more rows
# using count()
scorecard |>
  filter(type == "Public", net_cost < 12000) |>
# A tibble: 35 × 2
   state     n
   <chr> <int>
 1 AK        2
 2 AR        1
 3 CA       18
 4 CT        4
 5 DE        1
 6 FL       12
 7 FM        1
 8 GA        6
 9 IL        4
10 IN        9
# ℹ 25 more rows

Your turn: Generate a data frame with the 10 most expensive colleges in 2021-22 based on net cost of attendance.

We could use a combination of arrange() and slice() to sort the data frame from most to least expensive, then keep the first 10 rows:

# using desc()
arrange(.data = scorecard, desc(net_cost)) |>
# A tibble: 10 × 14
   unit_id name          state type  adm_rate sat_avg  cost net_cost avg_fac_sal
     <dbl> <chr>         <chr> <chr>    <dbl>   <dbl> <dbl>    <dbl>       <dbl>
 1  247649 Landmark Col… VT    Priv…    0.529      NA 79200    50759       59148
 2  136774 Ringling Col… FL    Priv…    0.687      NA 68123    50747       80217
 3  197151 School of Vi… NY    Priv…    0.713    1251 67606    50183       35478
 4  119775 Newschool of… CA    Priv…    0.608      NA 55686    48752       60804
 5  192712 Manhattan Sc… NY    Priv…    0.490      NA 71175    47969       70830
 6  449384 Gnomon        CA    Priv…    0.448      NA 53847    47099          NA
 7  214148 Moore Colleg… PA    Priv…    0.581      NA 64305    46750       65493
 8  197708 Yeshiva Univ… NY    Priv…    0.627      NA 65898    46632      115812
 9  109651 Art Center C… CA    Priv…    0.756      NA 68087    46201       71478
10  111081 California I… CA    Priv…    0.290      NA 74931    46083       85995
# ℹ 5 more variables: pct_pell <dbl>, comp_rate <dbl>, first_gen <dbl>,
#   debt <dbl>, locale <chr>
# using -
arrange(.data = scorecard, -net_cost) |>
# A tibble: 10 × 14
   unit_id name          state type  adm_rate sat_avg  cost net_cost avg_fac_sal
     <dbl> <chr>         <chr> <chr>    <dbl>   <dbl> <dbl>    <dbl>       <dbl>
 1  247649 Landmark Col… VT    Priv…    0.529      NA 79200    50759       59148
 2  136774 Ringling Col… FL    Priv…    0.687      NA 68123    50747       80217
 3  197151 School of Vi… NY    Priv…    0.713    1251 67606    50183       35478
 4  119775 Newschool of… CA    Priv…    0.608      NA 55686    48752       60804
 5  192712 Manhattan Sc… NY    Priv…    0.490      NA 71175    47969       70830
 6  449384 Gnomon        CA    Priv…    0.448      NA 53847    47099          NA
 7  214148 Moore Colleg… PA    Priv…    0.581      NA 64305    46750       65493
 8  197708 Yeshiva Univ… NY    Priv…    0.627      NA 65898    46632      115812
 9  109651 Art Center C… CA    Priv…    0.756      NA 68087    46201       71478
10  111081 California I… CA    Priv…    0.290      NA 74931    46083       85995
# ℹ 5 more variables: pct_pell <dbl>, comp_rate <dbl>, first_gen <dbl>,
#   debt <dbl>, locale <chr>

We can also use the slice_max() function in dplyr to accomplish the same thing with a single function.

slice_max(.data = scorecard, order_by = net_cost, n = 10)
# A tibble: 10 × 14
   unit_id name          state type  adm_rate sat_avg  cost net_cost avg_fac_sal
     <dbl> <chr>         <chr> <chr>    <dbl>   <dbl> <dbl>    <dbl>       <dbl>
 1  247649 Landmark Col… VT    Priv…    0.529      NA 79200    50759       59148
 2  136774 Ringling Col… FL    Priv…    0.687      NA 68123    50747       80217
 3  197151 School of Vi… NY    Priv…    0.713    1251 67606    50183       35478
 4  119775 Newschool of… CA    Priv…    0.608      NA 55686    48752       60804
 5  192712 Manhattan Sc… NY    Priv…    0.490      NA 71175    47969       70830
 6  449384 Gnomon        CA    Priv…    0.448      NA 53847    47099          NA
 7  214148 Moore Colleg… PA    Priv…    0.581      NA 64305    46750       65493
 8  197708 Yeshiva Univ… NY    Priv…    0.627      NA 65898    46632      115812
 9  109651 Art Center C… CA    Priv…    0.756      NA 68087    46201       71478
10  111081 California I… CA    Priv…    0.290      NA 74931    46083       85995
# ℹ 5 more variables: pct_pell <dbl>, comp_rate <dbl>, first_gen <dbl>,
#   debt <dbl>, locale <chr>

Your turn: Generate a data frame with the average SAT score for each type of college.

Note that since the sat_avg column contains NAs (missing values), we need to explicitly exclude them from our mean calculation. Otherwise the resulting data frame contains NAs.

# incorrect - ignores NAs
scorecard |>
  group_by(type) |>
  summarize(mean_sat = mean(sat_avg))
# A tibble: 3 × 2
  type                mean_sat
  <chr>                  <dbl>
1 Private, for-profit       NA
2 Private, nonprofit        NA
3 Public                    NA
# exclude NAs using mean()
scorecard |>
  group_by(type) |>
  summarize(mean_sat = mean(sat_avg, na.rm = TRUE))
# A tibble: 3 × 2
  type                mean_sat
  <chr>                  <dbl>
1 Private, for-profit    1270.
2 Private, nonprofit     1182.
3 Public                 1136.
# exclude NAs using drop_na() to remove the rows prior to summarizing
scorecard |>
  drop_na(sat_avg) |>
  group_by(type) |>
  summarize(mean_sat = mean(sat_avg))
# A tibble: 3 × 2
  type                mean_sat
  <chr>                  <dbl>
1 Private, for-profit    1270.
2 Private, nonprofit     1182.
3 Public                 1136.

Your turn: Calculate for each school how many students it takes to pay the average faculty member’s salary and generate a data frame with the school’s name, net cost of attendance, average faculty salary, and the calculated value. How many Cornell and Ithaca College students does it take to pay their average faculty member’s salary?


You should use the net cost of attendance measure, not the sticker price.

scorecard |>
  # mutate() to create a column with the ratio
  mutate(ratio = avg_fac_sal / net_cost) |>
  # select() to keep only the name and ratio columns
  select(name, net_cost, avg_fac_sal, ratio) |>
  # filter() to keep only Cornell and Ithaca College
  filter(name == "Cornell University" | name == "Ithaca College")
# A tibble: 2 × 4
  name               net_cost avg_fac_sal ratio
  <chr>                 <dbl>       <dbl> <dbl>
1 Cornell University    29011      141849  4.89
2 Ithaca College        33748       81369  2.41

Your turn: Calculate how many private, nonprofit schools have a smaller net cost than Cornell University.

You will need to create a new column that ranks the schools by net cost of attendance. Look at the back of the dplyr cheatsheet for functions that can be used to calculate rankings.

Reported as the number as the total number of schools:

scorecard |>
  # keep only private schools and sort by net cost in increasing order
  filter(type == "Private, nonprofit") |>
  arrange(net_cost) |>
  # use row_number() to rank each school by net cost but subtract 1
  # since Cornell is not cheaper than itself
  mutate(net_cost_rank = row_number() - 1) |>
  # examine output for Cornell
  filter(name == "Cornell University") |>
  select(name, net_cost, net_cost_rank)
# A tibble: 1 × 3
  name               net_cost net_cost_rank
  <chr>                 <dbl>         <dbl>
1 Cornell University    29011           869

Reported as the number as the percentage of schools:

scorecard |>
  # keep only private schools
  filter(type == "Private, nonprofit") |>
  # use percent_rank() to rank each school by net cost in percentiles
  mutate(net_cost_rank = percent_rank(net_cost)) |>
  # examine output for Cornell
  filter(name == "Cornell University") |>
  select(name, net_cost, net_cost_rank)
# A tibble: 1 × 3
  name               net_cost net_cost_rank
  <chr>                 <dbl>         <dbl>
1 Cornell University    29011         0.811
  1. College Scorecard is a product of the U.S. Department of Education and compiles detailed information about student completion, debt and repayment, earnings, and more for all degree-granting institutions across the country.↩︎

  2. The full database contains thousands of variables from 1996-2022.↩︎

  3. From dplyr vignette↩︎