AE 09: Scraping articles from the Cornell Review

Suggested answers

Application exercise
Answers
Modified

March 6, 2024

Packages

We will use the following packages in this application exercise.

  • tidyverse: For data import, wrangling, and visualization.
  • rvest: For scraping HTML files.
  • lubridate: For formatting date variables.
  • robotstxt: For verifying if we can scrape a website.
library(tidyverse)
library(rvest)
library(robotstxt)

Data scraping

See the code below stored in scrape-cornell-review.R.

# load packages
library(tidyverse)
library(rvest)
library(robotstxt)

# check that we can scrape data from the cornell review
paths_allowed("https://www.thecornellreview.org/")

# read the first page
page <- read_html("https://www.thecornellreview.org/")

# extract desired components
titles <- html_elements(x = page, css = "#main .read-title a") |>
  html_text2()

authors <- html_elements(x = page, css = "#main .byline a") |>
  html_text2()

article_dates <- html_elements(x = page, css = "#main .posts-date") |>
  html_text2()

topics <- html_elements(x = page, css = "#main .cat-links") |>
  html_text2()

abstracts <- html_elements(x = page, css = ".post-description") |>
  html_text2()

post_urls <- html_elements(x = page, css = ".aft-readmore") |>
  html_attr(name = "href")

# create a tibble with this data
review_raw <- tibble(
  title = titles,
  author = authors,
  date = article_dates,
  topic = topics,
  description = abstracts,
  url = post_urls
)

# clean up the data
review <- review_raw |>
  mutate(
    date = mdy(date),
    description = str_remove(string = description, pattern = "\nRead More")
  )

# save to disk
write_csv(x = review, file = "data/cornell-review.csv")
sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31)
 os       macOS Ventura 13.5.2
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2024-03-03
 pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 cli           3.6.2   2023-12-11 [1] CRAN (R 4.3.1)
 colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
 digest        0.6.34  2024-01-11 [1] CRAN (R 4.3.1)
 dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.3.1)
 evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.1)
 fansi         1.0.6   2023-12-08 [1] CRAN (R 4.3.1)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
 forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
 ggplot2     * 3.4.4   2023-10-12 [1] CRAN (R 4.3.1)
 glue          1.7.0   2024-01-09 [1] CRAN (R 4.3.1)
 gtable        0.3.4   2023-08-21 [1] CRAN (R 4.3.0)
 here          1.0.1   2020-12-13 [1] CRAN (R 4.3.0)
 hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
 htmltools     0.5.7   2023-11-03 [1] CRAN (R 4.3.1)
 htmlwidgets   1.6.4   2023-12-06 [1] CRAN (R 4.3.1)
 httr          1.4.7   2023-08-15 [1] CRAN (R 4.3.0)
 jsonlite      1.8.8   2023-12-04 [1] CRAN (R 4.3.1)
 knitr         1.45    2023-10-30 [1] CRAN (R 4.3.1)
 lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.3.1)
 lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.3.1)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
 munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.0)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
 purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
 readr       * 2.1.5   2024-01-10 [1] CRAN (R 4.3.1)
 rlang         1.1.3   2024-01-10 [1] CRAN (R 4.3.1)
 rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.3.1)
 robotstxt   * 0.7.13  2020-09-03 [1] CRAN (R 4.3.0)
 rprojroot     2.0.4   2023-11-05 [1] CRAN (R 4.3.1)
 rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)
 rvest       * 1.0.3   2022-08-19 [1] CRAN (R 4.3.0)
 scales        1.2.1   2024-01-18 [1] Github (r-lib/scales@c8eb772)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 stringi       1.8.3   2023-12-11 [1] CRAN (R 4.3.1)
 stringr     * 1.5.1   2023-11-14 [1] CRAN (R 4.3.1)
 tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
 tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)
 tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.0)
 tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
 timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.0)
 tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)
 utf8          1.2.4   2023-10-22 [1] CRAN (R 4.3.1)
 vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.3.1)
 withr         2.5.2   2023-10-30 [1] CRAN (R 4.3.1)
 xfun          0.41    2023-11-01 [1] CRAN (R 4.3.1)
 xml2          1.3.6   2023-12-04 [1] CRAN (R 4.3.1)
 yaml          2.3.8   2023-12-11 [1] CRAN (R 4.3.1)

 [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────