HW 06 - Prediction + Bootstrap

Homework
Modified

April 11, 2024

Important

This homework is due April 17 at 11:59pm.

Getting started

  • Go to the info2950-sp24 organization on GitHub. Click on the repo with the prefix hw-06. It contains the starter documents you need to complete the homework.

  • Clone the repo and start a new project in RStudio. See the Lab 0 instructions for details on cloning a repo and starting a new R project.

Workflow + formatting

Make sure to

  • Update author name on your document.
  • Label all code chunks informatively and concisely.
  • Follow the Tidyverse code style guidelines.
  • Make at least 3 commits.
  • Resize figures where needed, avoid tiny or huge plots.
  • Use informative labels for plot axes, titles, etc.
  • Consider aesthetic choices such as color, legend position, etc.
  • Turn in an organized, well formatted document.

Data and packages

The data can be found in the dsbox package, and it’s called gss16. Since the dataset is distributed with the package, we don’t need to load it separately; it becomes available to us when we load the package. And we’ll use the tidyverse and tidymodels packages as well.

library(tidyverse)
library(tidymodels)
library(dsbox)

You can find out more about the dataset by inspecting its documentation, which you can access by running ?gss16 in the Console or using the Help menu in RStudio to search for gss16. You can also find this information here.

Exercises

The General Social Survey is a biannual survey of the American public.

Exercise 1

On the 2016 edition, respondents were asked to respond to the prompt:

Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government.

Possible answers are Strongly agree, Agree, Don’t know, Disagree and Strongly Disagree.

Prep the data. Create a new data frame called gss16_advfront that includes the variables advfront, educ, polviews, and wrkstat.

  • Remove rows that contain NAs from this new data frame, as well as any respondents who stated “Don’t know” for the advfront question.

  • Re-level the advfront variable such that it has two levels: "Strongly agree" and “Agree" combined into a new level called "Agree" and the remaining levels combined into "Not agree". Ensure the levels are in the order "Agree" and "Not agree".

  • Combine the levels of the polviews variable such that levels that have the word “liberal” in them are lumped into a level called "Liberal" and those that have the word conservative in them are lumped into a level called "Conservative". Ensure the levels are in the order: "Conservative" , "Moderate", and "Liberal".

    Tip

    You can do this in various ways. One option is to use the str_detect() function to detect the existence of words. Note that these sometimes show up with lowercase first letters and sometimes with upper case first letters. For instance, if you wanted to detect “Apple” or “apple” with the str_detect() function, you could use [Aa]aple. However, solve the problem however you like, this is just one option!

Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 2

Predict scientific advancement. Fit a logistic regression model that predicts advfront by educ. Report the tidy output below and write out the estimated model in appropriate notation.

Using your estimated model, predict the probability of agreeing with the following statement: Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government (Agree in advfront) if you have an education of 7 years.

Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 3

  1. Fit a new model that adds the additional explanatory variable of polviews. Report the tidy output below.

  2. Now, predict the probability of agreeing with the following statement: Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government ("Agree" in advfront). Generate predictions for all plausible values of educ and polviews. Visualize the predicted probabilities using a color-coded line chart.

Tip

You should create a data frame that contains all possible combinations of educ and polviews. Do not rely solely on the output from augment() since there is no guarantee that all possible combinations actually were observed in the dataset. I recommend considering tidyr::expand_grid() for this task.

Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 4

Harassment in the workplace. In 2016, the GSS added a new question on harassment at work. The question is phrased as the following.

Over the past five years, have you been harassed by your superiors or co-workers at your job, for example, have you experienced any bullying, physical or psychological abuse?

Answers to this question are stored in the harass5 variable in our data set.

  1. Create a subset of the data that only contains Yes and No answers for the harassment question. How many responses chose each of these answers?

  2. Describe how bootstrapping can be used to estimate the proportion of all Americans who have been harassed by their superiors or co-workers at their job.

  3. Calculate a 95% bootstrap confidence interval for the proportion of Americans who have been harassed by their superiors or co-workers at their job. Use 1000 iterations when creating your bootstrap distribution. Interpret this interval in context of the data.

  4. Where was your 95% confidence interval centered? Why does this make sense?

  5. Now, calculate a 90% bootstrap confidence interval for the proportion of Americans who have been harassed by their superiors or co-workers at their job. Report the interval below. Is it wider or more narrow than the 95% confidence interval?

  6. Now, suppose you created a bootstrap distribution with 50,000 simulations instead of 1,000. What would you expect to change (if anything) about the 95% confidence interval?

  • Center of the CI
  • Width of the CI

Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Exercise 5

Harassment and ideology. Now we wish to answer the following research question:

Does the true proportion of individuals who experience harassment in the workplace differ between liberals and conservatives?

  1. Write out the appropriate null and alternative hypotheses, using both words and notations.
  2. Calculate the observed statistic (i.e. difference in proportions).
  3. Create the null distribution with 1,000 repeated samples. Visualize the null distribution and shade in the area used to calculate the p-value.
  4. Calculate the p-value. Use it to make your conclusion using a significance level of \(0.05\) and write out an appropriate conclusion below. Recall that the conclusion has 3 components.
    • How the p-value compares to the significance level.
    • The decision you make with respect to the hypotheses (reject \(H_0\) or fail to reject \(H_0\)).
    • The conclusion in the context of the alternative hypothesis.

Render, commit, and push one last time. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Wrap up

Submission

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
  • Click on your INFO 2950 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
  • Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.

Grading

  • Exercise 1: 5 points
  • Exercise 2: 10 points
  • Exercise 3: 10 points
  • Exercise 4: 10 points
  • Exercise 5: 10 points
  • Workflow + formatting: 5 points
  • Total: 50 points
Note

The “Workflow & formatting” component assesses the reproducible workflow. This includes:

  • Following tidyverse code style
  • All code being visible in rendered PDF (no more than 80 characters)
  • Appropriate figure sizing, and figures with informative labels and legends
  • Ensuring reproducibility by setting a random seed value.

Acknowledgments