Project description
Important dates
- Proposal due Wed, Mar 13th
- Data collection and exploratory data analysis due Wed, Mar 27th
- Preregistration of analyses due Wed, Apr 17th
- Draft report due Wed, Apr 24th
- Peer review due Fri, Apr 26th
- Presentation + slides due on Fri, May 3rd
- Final report due on Tue, May 7th
The details will be updated as the project date approaches.
Introduction
TL;DR: Ask a question you’re curious about and answer it with a dataset of your choice. This is your project in a nutshell.
May be too long, but please do read
This project is designed to give you experience with the full cycle of data science, from collecting observations to modeling to making arguments. Alumni often tell us that the final project was the most useful and memorable part of this class.
It should be something you are proud of and can display as part of a portfolio for job applications. The idea is that you take what you’ve learned through the course (via application exercises, labs, homeworks, exams) and apply it to a specific domain of interest to you. To do well on the final project, simply show us what you’ve learned!
A high-level overview
Pick a topic you find interesting, where there may be some quantitative data to analyze. Think hobbies, past classes you’ve found interesting, something you really care about, etc. It will be a lot easier to dedicate time to the project if you find the topic fundamentally interesting.
To get started, review the curriculum we’ve covered over the semester; it has been designed to give you a sense of a common data science workflow. Apply that workflow to your project:
- Collect data. Find (good) data (which will depend on the research questions you end up formulating). This may take some time, and you may not find exactly the data you want in one file or in one sitting. Your interest may also shift, the more you search and realize what kind of data is available within the topic. Be willing to keep looking for (additional) data and iterating on your topic!
- Explore your data. Start with the most basic types of analyses (summary statistics, histograms, scatterplots, etc.) to get a sense of the data. What could the data tell you? What kind of questions would it fail to answer?
- Write down a few concrete research questions and hypotheses. What have you noticed in exploring your data? What do you know about the processes that generated the data? Discuss these ideas with your collaborators (if working with others), with friends in the course, and with course staff. Do you need to gather any additional data? Slightly different data?
- Select your tools. Which analyses would be most appropriate for the type of data you’ve gathered and the questions you’d like to ask? What types of investigations has this class prepared you to conduct?
- Build models, analyze them, and test your hypotheses. Don’t throw out analyses that fail to show significance; these can still teach you something about the context from which your data came, or the data itself, if interpreted properly.
- Interpret your results. This is so important. Your results don’t matter unless you can make people understand why they matter. What do these results mean for the “real world” beyond your dataset? Were you limited in your conclusions because of issues with the data? What were the issues and how did they specifically impact your analyses?
Rinse and repeat. Doesn’t that look like a nice sequence of tasks? Unfortunately, it never works exactly that way! You will constantly go back and forth between steps. Instead of imagining the steps as 1 ➡️ 2 ➡️ … ➡️ 5, think of them as 1 ↔︎️ 2 ↔︎️ … ↔︎️ 5. Keep trying things until you feel like you’ve put together an interesting story for your final report.
Prepare a focused final report. Pick the clearest and most interesting analyses and contextualize them in your final report. Don’t just present numbers and plots: explain to the reader what they mean. Answer the question “so what?”. Put additional analyses tangential to the final direction of your project in (optional) appendices.
The project is very open ended. There is no limit on what tools or packages you may use but sticking to packages we learned in class is recommended. Neatness, coherency, and clarity will count. All analyses must be done in RStudio, using R, and all components of the project must be reproducible.
You will work on the project with your lab teams.
The three primary deliverables for the final project are
- A project proposal with three dataset ideas.
- A reproducible project writeup of your analysis, with one required and one optional draft along the way.
- A presentation with slides.
There will be additional submissions throughout the semester to facilitate completion of the final writeup and presentation.
Teams
Projects will be completed in teams of 3-5 students. Every team member should be involved in all aspects of planning and executing the project. Each team member should make an equal contribution to all parts of the project. The scope of your project is based on the number of contributing team members on your project. If you have 4 contributing team members, we will expect a larger project than a team of 3 contributing team members.
Some lab section meetings will be devoted to work on the project, so all teams will be formed within each lab section (i.e. only students in your lab section can be your team members). The course staff will assign students to teams. To facilitate this process, we will provide a short survey identifying study and communication habits. Once teams are assigned, they cannot be changed.
Team conflicts
Conflict is a healthy part of any team relationship. If your team doesn’t have conflict, then your team members are likely not communicating their issues with each other. Use your team contract (written at the beginning of the project) to help keep your team dynamic healthy.
When you have conflict, you should follow this procedure:
Refer to the team contract and follow it to address the conflict.
If you resolve the conflict without issue, great! Otherwise, update the team contract and try to resolve the conflict yourselves.
If your team is unable to resolve your conflict, please contact info2950@cornell.edu and explain your situation.
We’ll ask to meet with all the group members and figure out how we can work together to move forward.
Please do not avoid confrontation if you have conflict. If there’s a conflict, the best way to handle it is to bring it into the open and address it.
Project grade adjustments
Remember, do not do the work for a slacking team member. This only rewards their bad behavior. Simply leave their work unfinished. (We will not increase your grade during adjustments for doing more than your fair share.)
Your team will initially receive a final grade assuming that all team members contributed to your project. If you have a 5-person team, but only 3 persons contributed, your team will likely receive a lower grade initially because only 3 persons worth of effort exists for a 5-person project. About a week after the initial project grades are released, adjustments will be made to each individual team member’s group project grade.
We use your project’s Git history (to view the contributions of each team member) and the peer evaluations to adjust each team members’ grades. Both adjustments to increase or decrease your grade are possible based on each individual’s contributions.
For example, if you have a 4-person team, but only 3 contributing members, the 3 contributing members may have their grades increased to reflect the effort of only 3 contributing members. The non-contributing member will likely have their grade decreased significantly.
Please be patient for the grade adjustments. The adjustments take time to do them fairly. Please know that the instructor handles this entire process himself, and I take it very seriously. If you think your initial group project grade is unfair, please wait for your grade adjustment before you contact us.
The slacking team member
Please do not cover for a slacking/freeloading team member. Please do not do their work for them! This only rewards their bad behavior. Simply leave their work unfinished. (We will not increase your grade during adjustments for doing more than your fair share.)
Remember, we have your Git history. We can see who contributes to the project and who doesn’t. If a team member rarely commits to Git and only makes very small commits, we can see that they did not contribute their fair share.
All students should make their project contributions through their own GitHub account. Do not commit changes to the repository from another team member’s GitHub account. Your Git history should reflect your individual contributions to the project.
Proposal
There are two main purposes of the project proposal:
- To help you think about the project early, so you can get a head start on finding data, reading relevant literature, thinking about the questions you wish to answer, etc.
- To ensure that the data you wish to analyze, methods you plan to use, and the scope of your analysis are feasible and will allow you to be successful for this project.
Identify 3 primary or originally-collected datasets you’re interested in potentially using for the final project. If you’re unsure where to find data, you can use the list of potential data sources in the Tips + Resources section as a starting point. It may also help to think of topics you’re interested in investigating and find datasets on those topics.
Write the proposal in the proposal.qmd
file in your project repo.
You must use one of the datasets in the proposal for the final project, unless instructed otherwise when given feedback.
Criteria for datasets
The datasets should meet the following criteria:
- At least 500 observations
- At least 8 columns
- At least 6 of the columns must be useful and unique explanatory variables.
- Identifier variables such as “name”, “social security number”, etc. are not useful explanatory variables.
- If you have multiple columns with the same information (e.g. “state abbreviation” and “state name”), then they are not unique explanatory variables.
- You may not use data that has previously been used in any course materials, or any derivation of data that has been used in course materials.
- You may not use data from a secondary data archive. In plainest terms, do not use datasets you find from Kaggle or the UCI Machine Learning Repository. Your data should come from your own collection process (e.g. API or web scraping) or the primary source (e.g. government agency, research group, etc.).
Please ask a member of the course staff if you’re unsure whether your dataset meets the criteria.
If you set your hearts on a dataset that has fewer observations or variables than what’s suggested here, that might still be okay; use these numbers as guidance for a successful proposal, not as minimum requirements.
Resources for datasets
You can find data wherever you like, but here are some recommendations to get you started. You shouldn’t feel constrained to datasets that are already in a tidy format, you can start with data that needs cleaning and tidying, scrape data off the web, or collect your own data.
- Awesome public datasets
- CDC
- Chicago Open Data Portal
- Data.gov
- Data is Plural
- Election Studies
- European Statistics
- FiveThirtyEight
- General Social Survey
- Goodreads
- Google Dataset Search
- Harvard Dataverse
- International Monetary Fund
- IPUMS survey data from around the world
- Los Angeles Open Data
- National Weather Service
- NHS Scotland Open Data
- NYC OpenData
- Open access to Scotland’s official statistics
- Pew Research
- Project Gutenberg
- Reddit posts and/or comments
- Sports Reference
- Statistics Canada
- The National Bureau of Economic Research
- UK Government Data
- UNICEF Data
- United Nations Data
- United Nations Statistics Division
- US Census Data
- World Bank Data
- Youth Risk Behavior Surveillance System (YRBSS)
Proposal components
Introduction and data
For each dataset:
Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.
Address ethical concerns about the data, if any.
Research question
Your research question should contain at least three variables, and should be a mix of categorical and quantitative variables. When writing a research question, please think about the following:
What is your target population?
Is the question original?
Can the question be answered?
For each dataset, include the following:
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per dataset is required.)
- Statement on why this question is important.
- A description of the research topic along with a concise statement of your hypotheses on this topic.
- Identify the types of variables in your research question. Categorical? Quantitative?
Glimpse of data
For each dataset:
- Place the file containing your data in the
data
folder of the project repo. - Use the
skimr::skim()
function to provide a glimpse of the dataset.
Data collection and exploratory data analysis
Settle on a single idea and state your research question(s) clearly. You will carry out most of your data collection and cleaning, compute some relevant summary statistics, and show some plots of your data as applicable to your research question(s).
Write the data collection and exploratory data analysis in the eda.qmd
file in your project repo. It should include the following sections:
- Research question(s). State your research question(s) clearly.
- Data collection and cleaning. Have an initial draft of your data cleaning appendix. Document every step that takes your raw data file(s) and turns it into the analysis-ready data set that you would submit with your final project. Include text narrative describing your data collection (downloading, scraping, surveys, etc) and any additional data curation/cleaning (merging data frames, filtering, transformations of variables, etc). Include code for data curation/cleaning, but not collection.1
- Data description. Have an initial draft of your data description section. Your data description should be about your analysis-ready data.
- Data limitations. Identify any potential problems with your dataset.
- Exploratory data analysis. Perform an (initial) exploratory data analysis. This should include summary statistics, visualizations, and any other relevant information about the data. This is where you start to understand your data and identify any initial trends or relationships. You should also identify any potential problems with your dataset.
Thorough EDA requires substantial review and analysis of your data. You should not expect to complete this phase in a single day. You should expect to iterate through 20-30 charts, sets of summary statistics, etc., to get a good understanding of your data.
Visualizations are not expected to look perfect at this point since they are mainly intended for you and your team members. Standard expectations for visualizations (e.g. clearly labeled charts and axes, optimized color palettes) are not necessary at the EDA stage.
- Questions for reviewers. List specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase.
Preregistration of analyses
Preregister at least two analyses that you vow to present in your final report, no matter the results (significant or not). Do not perform these analyses, check they are significant, and then “preregister” them. That defeats the purpose. You can learn a lot from associations that are not statistically significant!
Keep in mind that for your final report, you must provide at least one model showing patterns or relationships between variables that addresses your research question.
Write your preregistered analyses in the pre-registration.qmd
file in your project repo. Make sure to include any questions you have for your project mentor.
Example preregistration
Hypothesis 1
Chocolate-based flavors are more popular on the East Coast than in the Midwest.
Analysis: Run a linear regression where we input region (as a dummy variable) and output proportion of chocolate-based flavors (# chocolate-based scoops / # total scoops) sold on an average day. The Midwest dummy will be our reference variable, so we will test whether \(\beta_{\text{East coast}} > 0\).
Hypothesis 2
The cost of ice cream increases with distance from Ithaca, NY.
Analysis: For data where each row represents a different location, we run a linear regression where we input the distance of the location from Ithaca, NY, and output price (in dollars) of ice cream. Because our distance measurement will have signed values (e.g. anything West of Ithaca would be negative miles and East would be positive miles), we will test whether \(\beta_{\text{Distance}} \neq 0\).
Draft
The purpose of the draft and peer review is to give you an opportunity to get early feedback on your analysis. Therefore, the draft and peer review will focus primarily on the exploratory data analysis, modeling, and initial interpretations.
Write the draft in the report.qmd
file in your project repo.
As you work on the draft, you should attempt to complete all the components of the final report, but it is okay to have some incompleteness or partial components. The focus should be on the analysis and less on crafting the final report. Your draft must include a reasonable attempt at each analysis component - data analysis, evaluation of significance, interpretation and conclusions.
Peer review
Critically reviewing others’ work is a crucial part of the scientific process, and INFO 2950 is no exception. You will be assigned two teams to review. This feedback is intended to help you create a high quality final project, as well as give you experience reading and constructively critiquing the work of others.
During the peer feedback process, you will be provided read-only access to your partner team’s GitHub repo. You will provide your feedback in the form of GitHub issues to your partner team’s GitHub repo.
Peer review process and questions are outlined in the relevant lab instructions.
Peer reviews will be graded on the extent to which they comprehensively and constructively address the components of the reviewee’s team’s report. Specifics of peer review grading are also outlined in the relevant lab instructions.
Report
Your written report must be completed in the report.qmd
file and must be reproducible. All team members should contribute to the GitHub repository, with regular meaningful commits.
Before you finalize your write up, make sure the printing of code chunks is off with the option echo: false
in the YAML.
The mandatory components of the report are below. You are free to add additional sections as necessary. The report, including visualizations, should be no more than 10 pages long. There is no minimum page requirement; however, you should comprehensively address all of the analysis in your report.
Be selective in what you include in your final write-up. The goal is to write a cohesive narrative that demonstrates a thorough and comprehensive analysis rather than explain every step of the analysis.
You are welcome to include an appendix with additional work at the end of the written report document; however, grading will largely be based on the content in the main body of the report. You should assume the reader will not see the material in the appendix unless prompted to view it in the main body of the report. The appendix should be neatly formatted and easy for the reader to navigate. It is not included in the 10-page limit.
Report components
Introduction
This section includes an introduction to the project motivation, data, and research question. What is the context of the work? What research question are you trying to answer? What are your main findings? Include a brief summary of your results.
Data description
This should be inspired by the format presented in Gebru et al, 2018. Answer any relevant questions from sections 3.1-3.5 of the Gebru et al article, especially the following questions:
- What are the observations (rows) and the attributes (columns)?
- Why was this dataset created?
- Who funded the creation of the dataset?
- What processes might have influenced what data was observed and recorded and what was not?
- What preprocessing was done, and how did the data come to be in the form that you are using?
- If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?
Data analysis
Use summary functions like mean and standard deviation along with visual displays like scatterplots and histograms to describe data.
Provide at least one model showing patterns or relationships between variables that addresses your research question. This could be regression or clustering, or something else that measures some property of the dataset.
Evaluation of significance
Use hypothesis tests, simulation, randomization, or any other techniques we have learned to compare the patterns you observe in the dataset to simple randomness.
Interpretation and conclusions
What did you find over the course of your data analysis, and how confident are you in these conclusions? Detail your results more so than in the introduction, now that the reader is familiar with your methods and analysis. Interpret these results in the wider context of the real-life application from where your data hails.
Limitations
What are the limitations of your study? What are the biases in your data or assumptions of your analyses that specifically affect the conclusions you’re able to draw?
Acknowledgments
Recognize any people or online resources that you found helpful. These can be tutorials, software packages, Stack Overflow questions, peers, and data sources. Showing gratitude is a great way to feel happier! But it also has the nice side-effect of reassuring us that you’re not passing off someone else’s work as your own. Crossover with other courses is permitted and encouraged, but it must be clearly stated, and it must be obvious what parts were and were not done for 2950. Copying without attribution robs you of the chance to learn, and wastes our time investigating.
Appendicies
You should submit your appendix(-ces) in the appendices.qmd
file in your project repo.
- At minimum, you should have an appendix for your data cleaning. Submit an updated version of your data cleaning description from phase II that describes all data cleaning steps performed on your raw data to turn it into the analysis-read dataset submitted with your final project. When rendered, it should output the dataset you submit as part of your project (e.g. written as a .csv file).
- (Optional) Other appendices. You will almost certainly feel that you have done a lot of work that didn’t end up in the final report. We want you to edit and focus, but we also want to make sure that there’s a place for work that didn’t work out or that didn’t fit in the final presentation. You may include any analyses you tried but were tangential to the final direction of your main report. Graders may briefly look at these appendices, but they also may not. You want to make your final report interesting enough that the graders don’t feel the need to look at other things you tried. “Interesting” doesn’t necessarily mean that the results in your final report were all statistically significant; it could be that your results were not significant but you were able to interpret them in an interesting and informed way.
Organization + formatting
While not a separate written section, you will be assessed on the overall presentation and formatting of the written report. A non-exhaustive list of criteria include:
- The report neatly written and organized with clear section headers and appropriately sized figures with informative labels.
- Numerical results are displayed with a reasonable number of digits, and all visualizations are neatly formatted.
- All citations and links are properly formatted.
- If there is an appendix, it is reasonably organized and easy for the reader to find relevant information.
- All code, warnings, and messages are suppressed.
- The main body of the written report (not including the appendix) is no longer than 10 pages.
Presentation + slides
Slides
In addition to the written report, your team will also create an oral presentation that summarizes and showcases your project. Using a slide presentation, you will introduce your research question and dataset, showcase visualizations, and discuss the primary conclusions. These slides should serve as a brief visual addition to your written report and will be graded for content and quality.
Your presentation will be created using Quarto, which allows you to write slides using the same reproducible document structure you’re used to.
The slide deck should have no more than 6 content slides + 1 title slide. Here is a suggested outline as you think through the slides; you do not have to use this exact format for the 6 slides.
- Title Slide
- Slide 1: Introduce the topic and motivation
- Slide 2: Introduce the data
- Slide 3: Highlights from EDA
- Slide 4-5: Inference/modeling/other analysis
- Slide 6: Conclusions + future work
Presentation
Presentations will take place in class during the last lab of the semester. The presentation must be no longer than 5 minutes. You can choose to present live in class (recommended) or pre-record a video to be shown in class. Either way you must attend the lab session for the Q&A following your presentation.
If you choose to pre-record your presentation, you may use can use any platform that works best for your group to record your presentation. Below are a few resources on recording videos:
Evaluation
Presentations will be evaluated by the course staff (15 points) and by your peers in your lab section (5 points). Students will receive access to a Google Form where they will provide (confidential) feedback on their peer groups’ presentations. Students will evaluate their own presentations.
Reproducibility + organization
All written work should be reproducible, and the GitHub repo should be neatly organized.
- Points for reproducibility + organization will be based on the reproducibility of the written report and the organization of the project GitHub repo.
- The repo should be neatly organized as described above, there should be no extraneous files, all text in the README should be easily readable.
Overall grading
The grade breakdown is as follows:
Total | 145 pts |
---|---|
Project proposal | 10 pts |
Data collection and exploratory data analysis | 15 pts |
Preregistration of analyses | 5 pts |
Draft report | 10 pts |
Peer review | 5 pts |
Final report | 70 pts |
Slides + presentation | 15 pts |
Slides + presentation (peer) | 5 pts |
Reproducibility + organization | 10 pts |
Evaluation criteria
Project proposal
Category | Less developed projects | Typical projects | More developed projects |
---|---|---|---|
Dataset ideas | Fewer than three dataset ideas are included. Dataset ideas are vague and impossible or excessively difficult to collect. |
Three datasets ideas are included and all or most datasets could feasibly be collected or accessed by the end of the semester. Each dataset is described alongside a note about availability with a source cited. |
Three datasets ideas are included and all or most datasets could feasibly be collected or accessed by the end of the semester. Each dataset is described alongside a note about availability with (possibly multiple) sources cited. Each dataset could reasonably be part of a data science project, driven by an interesting research question. |
Questions for reviewers | The questions for reviewers are vague or unclear. | The questions for reviewers are specific to the datasets and are based on group discussions between team members. | The questions for reviewers are specific to the datasets and are based on group discussions between team members. Questions for reviewers look toward the next stage of the project. |
Data collection and EDA
Category | Less developed projects | Typical projects | More developed projects |
Research question(s) | Question is not clearly stated or significantly limits potential analysis. | Clearly states the research question(s), which have moderate potential for interesting analyses. | Clearly states complex research question(s) that leads to significant potential for interesting analyses. |
Data cleaning | Data is minimally cleaned, with little documentation and description of the steps undertaken. | Completes all necessary data cleaning for subsequent analyses. Describes cleaning steps with some detail. |
Completes all necessary data cleaning for subsequent analyses. Describes all cleaning steps in full detail, so that the reader has an excellent grasp of how the raw data was transformed into the analysis-ready dataset. |
Data description | Simple description of some aspects of the dataset, little consideration for sources. The description is missing answers to applicable questions detailed in the “Datasheets for Datasets” paper. |
Answers all relevant questions in the “Datasheets for Datasets” paper. | All expectations of typical projects + credits and values data sources. |
Data limitations | The limitations are not explained in depth. There is no mention of how these limitations may affect the meaning of results. |
Identifies potential harms and data gaps, and describes how these could affect the meaning of results. | Creatively identifies potential harms and data gaps, and describes how these could affect the meaning of results, and the impact of results on people. It is evident that significant thought has been put into the limitations of the collected data. |
Exploratory data analysis | Motivation for choice of analysis methods is unclear. Does not justify decisions to either confirm / update research questions and data description. |
Sufficient plots (20-30) and summary statistics to identify typical values in single variables and connections between pairs of variables. Uses exploratory analysis to confirm/update research questions and data description. |
All expectations of typical projects + analysis methods are carefully chosen to identify important characteristics of data. |
Preregistration of analyses
Category | Less developed projects | Typical projects | More developed projects |
Preregistration statement | The preregistered analyses could be performed using the data collected, but it is not clear how they fit in the context of the real-world application from which the data originated. | The preregistered analyses are contextualized by the real-world application to a certain degree. The analyses are not described in a way that persuades the reader their results would be interesting, whether or not they turn out to be statistically significant. | The preregistered analyses reflect deep and critical thinking about the real-world application from which the data originates. The analyses are described in a way that persuades the reader that their results will be interesting, whether or not they turn out to be statistically significant. |
Draft report
Category | Less developed projects | Typical projects | More developed projects |
Data description | Simple description of some aspects of the dataset, little consideration for sources. The description is missing answers to applicable questions detailed in the “Datasheets for Datasets” paper. |
Answers all relevant questions in the “Datasheets for Datasets” paper. | All expectations of typical projects + credits and values data sources. |
Data analysis | Analyses selected are not clearly purposeful. Preregistered analyses are not presented. |
Analyses selected are purposeful and further the data narrative, but questions raised are not adequately addressed. Preregistered analyses are presented. |
All expectations of typical projects + analyses are carefully selected to answer all reasonable questions. Questions raised by one analysis are addressed in subsequent analyses. |
Evaluation of significance | Metrics of statistical significance are present, but not interpreted for the reader and/or relevant to the analysis performed. | Metrics of statistical significance appropriate to the analysis performed are presented and are interpreted to some degree for the reader. | Metrics of statistical significance appropriate to the analysis performed are presented and clearly interpreted for the reader. Limitations of significance metrics are acknowledged. |
Interpretation and conclusions | Results are presented as numeric values and plots, with little to no written discussion. Values are printed out of context, with no/few labels. |
Values are interpreted in a way that is clear and addresses what the values mean and explain to some extent why they are important. Values are printed with clear labels. |
Interprets numeric values in a way that supports a clear story and conclusion creatively ties analysis together to present the results of the analysis through a well-written discussion. Values are presented in context and with clear labels. |
Peer review
- Peer review issues open
- Reviews are constructive, actionable, and sufficiently thorough
Final report
Category | Less developed projects | Typical projects | More developed projects |
Introduction | Less focused and organized. They may jump to technical details without explaining why results are important. Research questions are not clearly stated and/or results are not clearly summarized at the end of the introduction. |
Provides background information and context. Introduces key terms and data sources. Outlines research question(s). Ends with a brief summary of findings. |
All expectations of typical projects + clearly describes why the setting is important and what is at stake in the results of the analysis. Even if the reader doesn’t know much about the subject, they know why they care about the results of your analysis. |
Data description | Simple description of some aspects of the dataset, little consideration for sources. The description is missing answers to applicable questions detailed in the “Datasheets for Datasets” paper. |
Answers all relevant questions in the “Datasheets for Datasets” paper. | All expectations of typical projects + credits and values data sources. |
Data analysis | Code closely matches examples from class, and does not go much further. Analyses selected are not clearly purposeful. Preregistered analyses are not presented. |
Code goes further than the examples presented in class. Analyses selected are purposeful and further the data narrative, but questions raised are not adequately addressed. Preregistered analyses are presented. |
All expectations of typical projects + analyses are carefully selected to answer all reasonable questions. Questions raised by one analysis are addressed in subsequent analyses. |
Evaluation of significance | Metrics of statistical significance are present, but not interpreted for the reader and/or relevant to the analysis performed. | Metrics of statistical significance appropriate to the analysis performed are presented and are interpreted to some degree for the reader. | Metrics of statistical significance appropriate to the analysis performed are presented and clearly interpreted for the reader. Limitations of significance metrics are acknowledged. |
Interpretation and conclusions | Results are presented as numeric values and plots, with little to no written discussion. Values are printed out of context, with no/few labels. |
Values are interpreted in a way that is clear and addresses what the values mean and explain to some extent why they are important. Values are printed with clear labels. |
Interprets numeric values in a way that supports a clear story and conclusion creatively ties analysis together to present the results of the analysis through a well-written discussion. Values are presented in context and with clear labels. |
Limitations | The limitations are not explained in depth. There is no mention of how these limitations may affect the meaning of results. |
Identifies potential harms and data gaps, and describes how these could affect the meaning of results. | Creatively identifies potential harms and data gaps, and describes how these could affect the meaning of results, as well as the impact of results on people. |
Writing | May have spelling and grammatical errors, or awkward or incomplete sentences, indicating that they were written in haste without editing. | Language will be polished and free from errors. | Writing is clear and complicated ideas are presented such that they are immediately understandable. |
Organization and focus | Work appears to have been done independently by team members and then merged at the last moment. Analyses may be exhaustive but carry little meaning or interpretation. There is not a very clear story throughout the entire report. |
Most elements of the project are clear and provide a connected conclusion. Some parts could have been removed to make the report more focused. There is a clear story that flows throughout most of the report. |
All elements of the project support a clear and connected conclusion. Every part is essential and cohesive. There is a clear story to the entire report that flows throughout. |
Slides + presentation
Category | Less developed projects | Typical projects | More developed projects |
---|---|---|---|
Time management | Only some members speak during the presentation. Team does not manage time wisely (e.g. runs out of time, finishes early without adequately presenting their project). | All members speak during the presentation. Team does not exceed the five minute limit. | Team maximally uses their five minutes. Clearly communicates their objectives and outcomes from the project. |
Professionalism | Presentation is slapped together or haphazard. Seems like independent pieces of work patched together. | Presentation appears to be rehearsed. There is cohesion to the presentation. | All elements of typical projects + everyone says something meaningful about the project. |
Slides | Slides contain excessive text and/or content. Team relies too heavily on slides for their presentation. |
Slides are well-organized. Slides are used as a tool to assist the oral presentation. |
All elements of typical projects + graphics and tables follow best-practices (e.g. all text is legible, appropriate use of color and legends). Slides are not crammed full of text. |
Creativity/originality | Project meets the minimum requirements but not much else. Project is incomplete or does not meet the team’s objectives. |
Project appears carefully thought out. Time and effort seem to have gone into the planning and implementation of the project. | All elements of typical projects + project goes above and beyond the minimum requirements. Addresses a truly important social issue or noteworthy goal. |
Content | Research question is unclear. Hypotheses and analysis do not clearly address the research question. Limitations are glossed over or ignored entirely. |
Research question is stated. Hypotheses and analysis address the research question. Limitations are noted. |
Research question is clearly stated. Hypotheses and analysis clearly address the research question. Limitations are carefully considered and articulated. |
Slides + presentation (peer)
Content: Is the project well designed and is the data being used relevant to the focus of the project?
Content: Did the team use appropriate methods and did they interpret them accurately?
Creativity and critical thought: Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?
Slides: Are the slides well organized, readable, not full of text, featuring figures with legible labels, legends, etc.?
Professionalism: How well did the team present? Does the presentation appear to be well practiced? Are they reading off of a script? Did everyone get a chance to say something meaningful about the project?
Reproducibility + organization
Category | Less developed projects | Typical projects |
---|---|---|
Reproducibility | Required files are missing. Quarto files do not render successfully (except for if a package needs to be installed). | All required files are provided. Quarto files render without issues and reproduce the necessary outputs. |
Data documentation | Codebook is missing. No local copies of data files. | All datasets are stored in a data folder, a codebook is provided, and a local copy of the data file is used in the code where needed. |
File readability | Documents lack a clear structure. There are extraneous materials in the repo and/or files are not clearly organized. | Documents (Quarto files and R scripts) are well structured and easy to follow. No extraneous materials. |
Issues | Issues have been left open, or are closed mostly without specific commits addressing them. | All issues are closed, mostly with specific commits addressing them. |
Late work policy
There is no late work accepted on this project. Be sure to turn in your work early to avoid any technological mishaps.
Footnotes
If you have written code to collect your data (e.g. using an API or web scraping), store this in a separate
.qmd
file or.R
script in the repo.↩︎