library(tidyverse)
library(tidymodels)
library(openintro)
Lab 05 - Smoking during pregnancy
This lab is due April 15 at 11:59pm.
Every year, the US releases to the public a large data set containing information on births recorded in the country. This data set has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. This is a random sample of 1,000 cases from the data set released in 2014.
Learning goals
- Constructing confidence intervals
- Conducting hypothesis tests
- Interpreting confidence intervals and results of hypothesis tests in context of the data
Getting started
Go to the info2950-sp24 organization on GitHub. Click on the repo with the prefix lab-05. It contains the starter documents you need to complete the lab. You will complete this lab with your team members.
Clone the repo and start a new project in RStudio. See the Lab 0 instructions for details on cloning a repo and starting a new R project.
All team members should clone the team GitHub repository for the lab. Then, one team member should edit the document YAML by adding the team name to the subtitle
field and adding the names of the team members contributing to lab to the author
field. Hopefully that’s everyone, but if someone doesn’t contribute during the lab session or throughout the week before the deadline, their name should not be added. If you have 4 members in your team, you can delete the line for the 5th team member. Then, this team member should render the document and commit and push the changes. All others should not touch the document at this stage.
title: "Lab 05 - Smoking during pregnancy"
subtitle: "Team name"
author:
- "Team member 1 (netID)"
- "Team member 2 (netID)"
- "Team member 3 (netID)"
- "Team member 4 (netID)"
- "Team member 5 (netID)"
date: today
format: pdf
editor: source
In the application exercises, we generated these values/plots manually using ggplot2 and dplyr operations. infer contains helper functions that you can use to automatically visualize the distribution of simulation-based inferential statistics, shade the plots with confidence intervals or p-values, and calculate confidence intervals and p-values directly. Refer to the documentation for more information on these functions. You are encouraged to use them for this assignment.
Packages
We’ll use the tidyverse package for much of the data wrangling and visualization, the tidymodels package for modeling and inference, and the data lives in the openintro package. These packages are already installed for you. You can load them by running the following in your Console:
Data
The data can be found in the openintro package, and it’s called births14
. Since the dataset is distributed with the package, we don’t need to load it separately; it becomes available to us when we load the package. You can find out more about the dataset by inspecting its documentation, which you can access by running ?births14
in the Console or using the Help menu in RStudio to search for births14
. You can also find this information here.
Set a seed!
In this lab we’ll be generating random samples. The last thing you want is those samples to change every time you render your document. So, you should set a seed. There’s an R chunk in your Quarto file set aside for this. Locate it and add a seed. Make sure all members in a team are using the same seed so that you don’t get merge conflicts and your results match up for the narratives.
Exercises
Exercise 1
What are the cases in this data set? How many cases are there in our sample?
The first step in the analysis of a new dataset is getting acquainted with the data. Make summaries of the variables in your dataset, determine which variables are categorical and which are numerical. For numerical variables, are there outliers? If you aren’t sure or want to take a closer look at the data, make a graph.
Exercise 2
A 1995 study suggests that average weight of Caucasian babies born in the US is 3,369 grams (7.43 pounds).1 In this dataset we only have information on mother’s race, so we will make the simplifying assumption that babies of Caucasian mothers are also Caucasian, i.e. whitemom = "white"
.
We want to evaluate whether the average weight of Caucasian babies has changed since 1995.
Our null hypothesis should state “there is nothing going on”, i.e. no change since 1995:
\[H_0: \mu = 7.43~\text{pounds}\]
Our alternative hypothesis should reflect the research question, i.e. some change since 1995. Since the research question doesn’t state a direction for the change, we use a two sided alternative hypothesis:
\[H_A: \mu \ne 7.43~\text{pounds}\]
Create a filtered data frame called births14_white
that contain data only from white mothers. Then, calculate the mean of the weights of their babies. Run the appropriate hypothesis test, visualize the null distribution, calculate the p-value, and interpret the results in context of the data and the hypothesis test.
For all hypothesis tests in this lab, use \(\alpha = 0.05\).
Exercise 3
Consider the possible relationship between a mother’s smoking habit and the weight of her baby. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.
Make side-by-side boxplots displaying the relationship between habit
and weight
for all birth mothers (i.e. not just white mothers). What does the plot highlight about the relationship between these two variables?
Before moving forward, save a version of the dataset omitting observations where there are NAs for habit
. You can call this version births14_habitgiven
.
Exercise 4
The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following to first group the data by the habit
variable, and then calculate the mean weight
in these groups using.
|>
births14_habitgiven group_by(habit) |>
summarize(mean_weight = mean(weight))
There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test.
Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different. Run the appropriate hypothesis test, calculate the p-value, and interpret the results in context of the data and the hypothesis test.
Construct a 95% confidence interval for the difference between the average weights of babies born to smoking and non-smoking mothers. Would you expect the 95% confidence interval to contain 0? Why or why not?
Exercise 5
In this portion of the analysis we focus on a possible relationship between a mother’s maturity (mature
) and the whether or not her baby has a low birth weight (lowbirthweight
).
First, a non-inference task: Determine the age cutoff for younger and mature mothers. How are mothers classified as mature or young?
Once you have distinguished between younger and mature mothers, conduct a hypothesis test evaluating whether the proportion of low birth weight babies is higher for mature mothers. State the hypotheses, verify the conditions, run the test and calculate the p-value, and state your conclusion in context of the research question. If you find a significant difference, construct a confidence interval, at the equivalent level to the hypothesis test, for the difference between the proportions of low birth weight babies between mature and younger mothers, and interpret this interval in context of the data.
Submission
Once you are finished with the lab, you will submit your final PDF document to Gradescope.
Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.
You must turn in a PDF file to the Gradescope page by the submission deadline to be considered “on time”.
This is a team submission. Everyone in the team should contribute to the assignment.
To submit your assignment, have one team member:
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
- Click on your INFO 2950 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your lab should be associated with at least one question (i.e., should be “checked”).
- Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.
- Follow the Gradescope instructions to include all your team members on the submission.
Grading
Component | Points |
---|---|
Ex 1 | 5 |
Ex 2 | 10 |
Ex 3 | 10 |
Ex 4 | 10 |
Ex 5 | 10 |
Workflow & formatting | 5 |
Total | 50 |
The “Workflow & formatting” component assesses the reproducible workflow. This includes:
- Following tidyverse code style
- All code being visible in rendered PDF (no more than 80 characters)
- Appropriate figure sizing, and figures with informative labels and legends
- Ensuring reproducibility by setting a random seed value.
Acknowledgments
- This assignment is derived from Data Science in a Box and licensed under CC BY-SA 4.0.
Footnotes
Wen, Shi Wu, Michael S. Kramer, and Robert H. Usher. “Comparison of birth weight distributions between Chinese and Caucasian infants.” American Journal of Epidemiology 141.12 (1995): 1177-1187.↩︎