HW 05 - Regressing socioeconomic data

Homework
Modified

March 24, 2024

Important

This homework is due March 27 at 11:59pm ET.

Getting started

  • Go to the info2950-sp24 organization on GitHub. Click on the repo with the prefix hw-05. It contains the starter documents you need to complete the homework.

  • Clone the repo and start a new project in RStudio. See the Lab 0 instructions for details on cloning a repo and starting a new R project.

Workflow + formatting

Make sure to

  • Update author name on your document.
  • Label all code chunks informatively and concisely.
  • Follow the Tidyverse code style guidelines.
  • Make at least 3 commits.
  • Resize figures where needed, avoid tiny or huge plots.
  • Use informative labels for plot axes, titles, etc.
  • Consider aesthetic choices such as color, legend position, etc.
  • Turn in an organized, well formatted document.

Packages

We’ll use the tidyverse for much of the data wrangling and visualization and tidymodels for modeling. In addition, we’ll use the palmerpenguins and wbstats for data.

library(tidyverse)
library(tidymodels)
library(palmerpenguins)
library(wbstats)

Exercises

Exercise 1 - A work of aRt

Your task is to make the following plot as ugly and as ineffective as possible. Change colors, axes, fonts, theme, or anything else you can think of in the code chunk below. You can also search online for other themes, fonts, etc. that you want to tweak. Try to make it as ugly as possible, the sky is the limit!

In 2-3 sentences, explain why the plot you created is ugly (to you, at least) and ineffective.

ggplot(
  data = penguins, 
  mapping = aes(x = flipper_length_mm, y = bill_length_mm, color = species)
  ) + 
  geom_point()
Warning: Removed 2 rows containing missing values (`geom_point()`).

Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 2 - Get data from the World Bank

The World Bank publishes extensive socioeconomic data on countries/economies around the world.1 We will use an excerpt of their data to explore relationships among world health metrics across countries and regions for 2021.

Rather than downloading every country’s complete indicators as standalone spreadsheet files and performing a complex iterative operation to import, tidy, and combine them together, we will use the World Bank’s API to obtain our required data in a tidy format. Specifically, use wbstats to obtain GDP per capita (current US$) and Life expectancy at birth, total (years) for 2021. Then combine those indicators with the World Bank’s country metadata2 so we know each country’s region. Filter out any rows with missing values. Store the final data frame as gdp_wb.

The resulting data frame should contain 201 observations.

Exercise 3 - GDP vs. life expectancy

  1. Visualization: We are interested in learning more about GDP per capita (hereby referred to simply as GDP), and we’ll start with exploring the relationship between GDP and life expectancy. Create two visualizations:

    • Scatter plot of GDP vs. life expectancy. GDP is the dependent variable/outcome of interest for all of the exercises.

    • Scatter plot of logged GDP vs. life expectancy. Modify the plot’s scale to implement a log transformation for GDP per capita.

    Note

    For the visualizations, you can use scale_*_log10() to transform the axis. For model estimation, use natural logarithms so that the coefficients are interpretable using the method described in class. The default base for log() is \(e\), so you don’t have to change any arguments.

    First describe the relationship between each pair of the variables. Then, comment on which relationship would be better modeled using a linear model, and explain your reasoning.

  2. Model fitting and interpretation:

    • Fit a linear model predicting log GDP from life expectancy. Display the tidy summary.

    • Interpret the intercept of the model, making sure that your interpretation is in the units of the original data (not on log scale).

    • Interpret the slope of the model, making sure that your interpretation is in the units of the original data (not on log scale).

  3. Model evaluation:

    • Calculate the R-squared of the model using two methods and confirm that the values match: first method is using glance() and the other method is based on the value of the correlation coefficient between the two variables.

    • Interpret R-squared in the context of the data and the research question.

Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 4 - GDP vs. life expectancy + region

Next, we want to examine if the relationship between GDP and life expectancy that we observed in the previous exercise holds across all regions in our data. We’ll continue to work with logged GDP.

  1. Justification: Create a scatter plot of logged GDP vs. life expectancy, with some method of differentiating between each region. Do you think the trend between life expectancy and GDP is different for different regions? Justify your answer with specific features of the plot.

  2. Model fitting and interpretation:

    • Regardless of your answer in part (a), fit an additive model (main effects) that predicts logged GDP from life expectancy and region (with Europe & Central Asia as the baseline level). Display a tidy summary of the model output.

    • Interpret the intercept of the model, making sure that your interpretation is in the units of the original data (not on log scale).

    • Interpret the slope of the model (e.g. life expectancy), making sure that your interpretation is in the units of the original data (not on log scale).

  3. Prediction: Predict the GDP of a country in East Asia and Pacific where the average life expectancy is 70 years old.

Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 5 - GDP vs. life expectancy x region

Finally, we want to examine if the relationship between GDP and life expectancy that we observed in the previous exercise holds across all regions in our data again, this time allowing for different relationships between GDP and life expectancy across regions. We’ll continue to work with logged GDP.

  1. Model fitting and interpretation: Fit an interaction model that predicts logged GDP from life expectancy and region (with Europe & Central Asia as the baseline level). Display a tidy summary of the model output and in 2-3 sentences, explain how this model differs from the additive model.

  2. Prediction: Predict the GDP of a country in East Asia and Pacific where the average life expectancy is 70 years old. Is this prediction different from your prediction with the additive model from Exercise 3?

  3. Model evaluation: Would the R-squared value for this model in exercise 5 be larger or smaller than the R-squared value for the model in exercise 4. Without running any code, please write out and justify your answer.

Render, commit, and push one last time. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Wrap up

Submission

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
  • Click on your INFO 2950 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
  • Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.

Grading

  • Exercise 1: 5 points
  • Exercise 2: 10 points
  • Exercise 3: 10 points
  • Exercise 4: 10 points
  • Exercise 5: 10 points
  • Workflow + formatting: 5 points
  • Total: 50 points
Note

The “Workflow & formatting” component assesses the reproducible workflow. This includes:

  • Following tidyverse code style
  • All code being visible in rendered PDF (no more than 80 characters)
  • Appropriate figure sizing, and figures with informative labels and legends

Footnotes

  1. You knew this from homework 04.↩︎

  2. See wb_countries() for more information.↩︎