Scraping data from the web

Lecture 11

Dr. Benjamin Soltoff

Cornell University
INFO 2950 - Spring 2024

February 29, 2024

Announcements

Announcements

  • Homework 03
  • Work on project proposals

Web scraping

Scraping the web: what? why?

  • Increasing amount of data is available on the web

  • These data are provided in an unstructured format: you can always copy & paste, but it’s time-consuming and prone to errors

  • Web scraping is the process of extracting this information automatically and transform it into a structured dataset

  • Two different scenarios:

    • Screen scraping: extract data from source code of website, with HTML parser (easy) or regular expression matching (less easy).

    • Web APIs (application programming interface): website offers a set of structured HTTP requests that return JSON or XML files.

HyperText Markup Language

  • Most of the data on the web is still largely available as HTML
  • It is structured (hierarchical / tree based), but it’s often not available in a form useful for analysis (flat / tidy).
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p align="center">Hello world!</p>
  </body>
</html>

rvest

  • The rvest package makes basic processing and manipulation of HTML data straight forward
  • It’s designed to work with pipelines built with |>
  • rvest.tidyverse.org

rvest hex logo

Core rvest functions

  • read_html() - Read HTML data from a url or character string (actually from the xml2 package, but most often used along with other rvest functions)
  • html_element() / html_elements() - Select a specified element(s) from HTML document
  • html_table() - Parse an HTML table into a data frame
  • html_text() - Extract text from an element
  • html_text2() - Extract text from an element and lightly format it to match how text looks in the browser
  • html_name() - Extract elements’ names
  • html_attr() / html_attrs() - Extract a single attribute or all attributes

Application exercise

The Cornell Review

A screenshot of the homepage for the Cornell Review

Goal

  • Scrape data and organize it in a tidy format in R
  • Perform light text parsing to clean data
  • Summarize and visualize the data (next week)

ae-09

  • Go to the course GitHub org and find your ae-09 (repo name will be suffixed with your GitHub name).
  • Clone the repo in RStudio Workbench, open the R script in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – end of tomorrow

Recap

  • Use the SelectorGadget to identify elements you want to grab
  • Use rvest to first read the whole page (into R) and then parse the object you’ve read in to the elements you’re interested in
  • Put the components together in a data frame (a tibble) and analyze it like you analyze any other data

A new R workflow

  • When working in a Quarto document, your analysis is re-run each time you render

  • If web scraping in a Quarto document, you’d be re-scraping the data each time you render, which is undesirable (and not nice)!

  • An alternative workflow:

    • Use an R script to save your code
    • Save interim data scraped using the code in the script as CSV or RDS files
    • Use the saved data in your analysis in your Quarto document

Web scraping considerations

Ethics: “Can you?” vs “Should you?”

“Can you?” vs “Should you?”

Challenges: Unreliable formatting

Challenges: Data broken into many pages

Workflow: Screen scraping vs. APIs

Two different scenarios for web scraping:

  • Screen scraping: extract data from source code of website, with HTML parser (easy) or regular expression matching (less easy)

  • Web APIs (application programming interface): website offers a set of structured HTTP requests that return JSON or XML files

    • We’ll learn this approach in a couple of weeks

Code
library(tidyverse)
library(rvest)
library(tvthemes)

# get episode ratings for season 1
ratings_page <- read_html(x = "https://www.imdb.com/title/tt9018736/episodes/?ref_=tt_eps_sm")

# extract elements
ratings_raw <- tibble(
  episode = html_elements(x = ratings_page, css = ".bblZrR .ipc-title__text") |>
    html_text2(),
  rating = html_elements(x = ratings_page, css = ".ratingGroup--imdb-rating") |>
    html_text2()
)

# clean data
ratings <- ratings_raw |>
  # separate episode number and title
  separate_wider_delim(
    cols = episode,
    delim = " ∙ ",
    names = c("episode_number", "episode_title")
  ) |>
  separate_wider_delim(
    cols = episode_number,
    delim = ".",
    names = c("season", "episode_number")
  ) |>
  # separate rating and number of votes
  separate_wider_delim(
    cols = rating,
    delim = " ",
    names = c("rating", "votes")
  ) |>
  # convert numeric variables
  mutate(
    across(
      .cols = -episode_title,
      .fns = parse_number
    ),
    votes = votes * 1e03
  )

# draw the plot
ratings |>
  # generate x-axis tick mark labels with title and epsiode number
  mutate(
    episode_title = str_glue("{episode_title}\n(S{season}E{episode_number})"),
    episode_title = fct_reorder(.f = episode_title, .x = episode_number)
  ) |>
  # draw a lollipop chart
  ggplot(mapping = aes(x = episode_title, y = rating)) +
  geom_point(mapping = aes(size = votes)) +
  geom_segment(
    mapping = aes(
      x = episode_title, xend = episode_title,
      y = 0, yend = rating
    )
  ) +
  # adjust the size scale
  scale_size(range = c(3, 8)) +
  # label the chart
  labs(
    title = "Live-action Avatar The Last Airbender is decent",
    x = NULL,
    y = "IMDB rating",
    caption = "Source: IMDB"
  ) +
  # use an Avatar theme
  theme_avatar(
    # custom font
    title.font = "Slayer",
    text.font = "Slayer",
    legend.font = "Slayer",
    # shrink legend text size
    legend.title.size = 8,
    legend.text.size = 6
  ) +
  theme(
    # remove undesired grid lines
    panel.grid.major.x = element_blank(),
    panel.grid.minor.y = element_blank(),
    # move legend to the top
    legend.position = "top",
    # align title flush with the edge
    plot.title.position = "plot",
    # shink x-axis text labels to fit
    axis.text.x = element_text(size = rel(x = 0.7))
  )

It was decent