Project tips + resources

Modified

March 6, 2024

The project is very open ended. For instance, in creating a compelling visualization(s) of your data in R, there is no limit on what tools or packages you may use. You do not need to visualize all of the data at once. A single high quality visualization will receive a much higher grade than a large number of poor quality visualizations.

Before you finalize your write up, make sure the printing of code chunks is turned off with the option echo: false. In addition to code chunks, ensure all messages are turned off with the options warning: false and message: false.

Finally, pay attention to details in your write-up and presentation. Neatness, coherency, and clarity will count.

Tips

  • Ask questions if any of the expectations are unclear.

  • Code: In your write up your code should be hidden (echo: false) so that your document is neat and easy to read. However your document should include all your code such that if I re-render your .qmd file I should be able to obtain the results you presented.

    • Exception: If you want to highlight something specific about a piece of code, you’re welcome to show that portion.
  • Merge conflicts will happen, issues will arise, and that’s fine! Commit and push often, and ask questions when stuck.

  • Make sure each team member is contributing, both in terms of quality and quantity of contribution (we will be reviewing commits from different team members).

    • All team members are expected to contribute equally to the completion of this assignment and group assessments will be given at its completion - anyone judged to not have sufficient contributed to the final product will have their grade reduced. While different teams members may have different backgrounds and abilities, it is the responsibility of every team member to understand how and why all code and approaches in the assignment works.

Producing an advanced final project

Your job as a writer and as a data scientist is effectively communicating what your dataset is about. The technical details of the data set will be described in your datasheet (follow a template as in the examples in sections 3.1-3.5 of this article on datasheets. We expect the total length should be 1500-3000 words. Inside this range, length will not be a factor in grading.

Introduction

The introduction should be the exposition of the article where you can use less rigorous language. Your language should be generally accessible. Aim for this to be readable by someone who hasn’t taken this class (maybe your roommate, your family, or you at the start of the semester). It should still be formal, but someone should come to the end and want to read more. FiveThirtyEight articles might be a good baseline tone for this.

Advanced introductions will immediately tell us what the setting is, what you found, and why it matters. They will add details as they are needed. Language will be polished and free from errors (Note: if your group does not include a native English speaker, make a note of that). Beginning writeups will be less focused and organized. They may jump to technical details without explaining why results are important. They may have spelling and grammatical errors, or awkward or incomplete sentences, indicating that they were written in haste and never reviewed.

Datasheet

As described above, in the style of Gebru et al. Think of this as the “origin story” of your data set. Answer all of the questions listed in the previous section. You can write this in any style as long as it’s easy to read as a Q&A. Datasheet will be graded on content, not style. Follow sections 3.1- 3.5 (Motivation to Uses) in this article.

Data analysis and evaluation of significance

Here you will clearly detail your methods used in each part. Qualitative claims made in the exposition should have numerical backing here (instead of “X is larger than Y” write “X is 3.65 times larger than Y”). This should read like a scientific paper, but does not need to be “stuffy” or overly indirect: “we did …” is more natural than “… was done”. A reader should be able to replicate your experiments and findings via their own code after reading this.

It’s important to organize your analysis. Common organizational patterns:

  • Big to small. Start with a high-level description of the complete dataset, then add more detail and increase specificity until you are looking at individual data points.
  • Small to big. The opposite: start with individual data points, then “zoom out” progressively until you get to a broad, top-level overview.
  • Bites at the apple. Visit different facets of the dataset. This could be subsets of the observations along different criteria, or a series of aggregate views where you are grouping by different variables (e.g. alumni by state, then by industry, then by major).

In most cases you will try many possible analyses. You don’t have to report everything that you did. Find a good selection that makes sense. In most datasets there are potentially thousands of different functions that you could analyze. Why are the ones you chose the most interesting?

Advanced analyses will be clear, logical, and methodical. Mathematical modeling will a have clear purpose that answers relevant questions and contributes to an overall perspective. Results will be contextualized with significance tests or comparisons to alternative simpler explanations. Reasonable “next questions” should be followed or acknowledged, though you don’t have to follow every lead. Beginning analyses will be disorganized and haphazard. They will apply models without context or purpose. They report results without considering whether those results are meaningful or random noise.

Interpretation and conclusions

This section should reflect on what you accomplished and where you might go from here. These can be hard to write without feeling repetitive. The conclusion is a good place to mention things that you tried that did not work, or data that you could not find but that you would add in a hypothetical further version.

Limitations

“Data” is a selective view of the world that may have been produced in any number of limited, skewed, or biased ways. “Models” are exactly that: miniature representations of real processes that capture some essential relationships but eliminate everything else. Identifying the limitations of your work is a critical part of the data science process.

“Do I need to … to get a good grade?”

What we want you to do is make an argument based on a data set. If the perfect data set already exists, great! You have more time to work on the details of the modeling and the presentation. In many cases the data set you want doesn’t exist in the form you are looking for, and you need to do some work to create it. We want you to have tools to do that if needed. But even if you think you have exactly the data you want, you may find that in investigating it you realize that there are additional questions that require more data collection.

Students often find this kind of open-ended project difficult because it requires more independence and feels more risky. We’ve built in multiple low-stakes checkpoints to help make sure we give you feedback and reduce the feeling of “flying blind”. But it’s also the most realistic and valuable experience to prepare you for what you will do after graduation, and the thing that alums most often remember years later.

Writing help

If you could use more support in your writing, the Cornell Writing Centers are a free resource available to students looking to improve their writing. They can help you with specific assignments (like the final project phases!) and they offer their services both in-person and online.

Formatting + communication

Suppress code and warnings

  • Include the following in the YAML of your report.qmd to suppress all code, warnings, and other messages.
execute:
  echo: false
  warning: false

Headers

Use headers to clearly label each section. Make sure there is a space between the previous line and the header. Use appropriate header levels.

References

Include all references in a section called “References” at the end of the report. This course does not have specific requirements for formatting citations and references. Optional: Use Quarto’s citation support for generating your reference. See Citations & Footnotes on the Quarto documentation for more on that.

Appendix

If you have additional work that does not fit or does not belong in the body of the report, you may put it at the end of the document in section called “Appendix”. The items in the appendix should be properly labeled. The appendix should only be for additional material. The reader should be able to fully understand your report without viewing content in the appendix. We will not grade your appendix.

Resize figures

Resize plots and figures, so you have more space for the narrative. Resize individual figures: Set fig-width and fig-height in chunk options, e.g.,

#| echo: fenced
#| label: plot1
#| fig-height: 3
#| fig-width: 5

replacing plot1 with a meaningful label and the height and width with values appropriate for your write up.

Resize all figures: Include the fig-height and fig-width options in the YAML header as shown below:

execute:
  fig-height: 3
  fig-width: 5

Replace the height and width values with values appropriate for your write up.

Arranging plots

Arrange plots in a grid, instead of one after the other. This is especially useful when displaying plots for exploratory data analysis and to check assumptions.

If you’re using ggplot2 functions, the patchwork package makes it easy to arrange plots in a grid.