Scraping data from the web

Lecture 11

Dr. Benjamin Soltoff

Cornell University
INFO 2950 - Summer 2024

June 10, 2024

Announcements

Increasing amount of data is available on the web
These data are provided in an unstructured format: you can always copy & paste, but it’s time-consuming and prone to errors
Web scraping is the process of extracting this information automatically and transform it into a structured dataset
Two different scenarios:
- Screen scraping: extract data from source code of website, with HTML parser (easy) or regular expression matching (less easy).
- Web APIs (application programming interface): website offers a set of structured HTTP requests that return JSON or XML files.

Most of the data on the web is still largely available as HTML
It is structured (hierarchical / tree based), but it’s often not available in a form useful for analysis (flat / tidy).

<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p align="center">Hello world!</p>
  </body>
</html>

The rvest package makes basic processing and manipulation of HTML data straight forward
It’s designed to work with pipelines built with |>
rvest.tidyverse.org

read_html() - Read HTML data from a url or character string (actually from the xml2 package, but most often used along with other rvest functions)
html_element() / html_elements() - Select a specified element(s) from HTML document
html_table() - Parse an HTML table into a data frame
html_text() - Extract text from an element
html_text2() - Extract text from an element and lightly format it to match how text looks in the browser
html_name() - Extract elements’ names
html_attr() / html_attrs() - Extract a single attribute or all attributes

Go to the course GitHub org and find your ae-09 (repo name will be suffixed with your GitHub name).
Clone the repo in RStudio Workbench, open the R script in the repo, and follow along and complete the exercises.
Render, commit, and push your edits by the AE deadline – end of the day

Use the SelectorGadget to identify elements you want to grab
Use rvest to first read the whole page (into R) and then parse the object you’ve read in to the elements you’re interested in
Put the components together in a data frame (a tibble) and analyze it like you analyze any other data

When working in a Quarto document, your analysis is re-run each time you render
If web scraping in a Quarto document, you’d be re-scraping the data each time you render, which is undesirable (and not nice)!
An alternative workflow:
- Use an R script to save your code
- Save interim data scraped using the code in the script as CSV or RDS files
- Use the saved data in your analysis in your Quarto document

Two different scenarios for web scraping:

Screen scraping: extract data from source code of website, with HTML parser (easy) or regular expression matching (less easy)
Web APIs (application programming interface): website offers a set of structured HTTP requests that return JSON or XML files
- We’ll learn this approach in a couple of weeks