Meet the toolkit

Lecture 2

Dr. Benjamin Soltoff

Cornell University
INFO 2950 - Spring 2024

January 25, 2024

Announcements

Discussion sections merging
- Sections 201 and 202 (9:05am)
- Sections 205 and 206 (11:15am)
If you cannot access RStudio Workbench yet, email info2950@cornell.edu
Application exercises 00 are ungraded
Complete homework 00 in Canvas

Making INFO 2950 a success

Five tips for success

Complete all the preparation work before class.
Ask questions.
Do the readings.
Do the lab and homework assignments.
Don’t procrastinate and don’t let a week pass by with lingering questions.

Course FAQ

Q - What data science background does this course assume?
A - None! Sort of…

Q - Is this an intro stats course?
A - No. We presume you have already met the prereq and taken one of AEM 2100, BTRY 3010, CEE 3040, ECON 3110, ECON 3130, ENGRD 2700, ILRST 2100, MATH 1710, PAM 2100, PSYCH 2500, SOC 3010, STSCI 2100, STSCI 2150, STSCI 2200 🙄

While statistics $\ne$ data science, they are very closely related and have tremendous of overlap.

Q - Will we be doing computing?
A - Yes! Lots of it.

Course FAQ

Q - Is this an intro CS course?
A - No – you’ve already taken CS 1110 or 1112

Q - What computing language will we learn?
A - R.

Q: Why not Python?
A: Come meet with me during office hours and we can talk about it!

Q: I don’t want to learn R! Will the course use Python again?
A: Probably when someone else teaches the course next year.

Course toolkit

Course operation

Materials: info2950.infosci.cornell.edu
Submission: Gradescope
Discussion: GitHub Discussions
Gradebook: Canvas

Doing data science

Computing:
- R
- RStudio
- tidyverse
- Quarto
Version control and collaboration:
- Git
- GitHub

Toolkit: Computing

Learning goals

By the end of the semester, you will…

Conduct exploratory data analysis through data wrangling and munging as well as visualizations and summary statistics.
Identify patterns in data to make predictions or to identify associations between variables.
Evaluate the strength of patterns using statistical and substantive significance.
Implement data science workflows using common, reproducible methods and software tools.
Use data ethically and responsibly.

Reproducible data analysis

Reproducibility checklist

What does it mean for a data analysis to be “reproducible”?

Near-term goals:

Are the tables and figures reproducible from the code and data?
Does the code actually do what you think it does?
In addition to what was done, is it clear why it was done?

Long-term goals:

Can the code be used for other data?
Can you extend the code to do other things?

Toolkit for reproducibility

Scriptability $\rightarrow$ R
Literate programming (code, narrative, output in one place) $\rightarrow$ Quarto
Version control $\rightarrow$ Git / GitHub

R and RStudio

R logo

R is an open-source statistical programming language
R is also an environment for statistical computing and graphics
It’s easily extensible with packages

RStudio logo

RStudio is a convenient interface for R called an IDE (integrated development environment), e.g. “I write R code in the RStudio IDE”
RStudio is not a requirement for programming with R,¹ but it’s very commonly used by R programmers and data scientists

R vs. RStudio

On the left: a car engine. On the right: a car dashboard. The engine is labelled R. The dashboard is labelled RStudio.

R packages

Packages: Fundamental units of reproducible R code, including reusable R functions, the documentation that describes how to use them, and sample data¹
As of January 2024, there are over 20,000 R packages available on CRAN (the Comprehensive R Archive Network)²
We’re going to work with a small (but important) subset of these!

Tour: R and RStudio

A short list (for now) of R essentials

R is a functional language
Functions are (most often) verbs, followed by what they will be applied to in parentheses:
```
do_this(to_this)
do_that(to_this, to_that, with_those)
```
Packages are installed with the install.packages() function and loaded with the library function, once per session:
```
install.packages("package_name")
library(package_name)
```
You never rename packages when you load them:
```
import pandas as pd
```
```
library(tidyverse as td)
```

R essentials (continued)

Columns (variables) in data frames are accessed with $:

dataframe$var_name

Object documentation can be accessed with ?

?mean

R and Python

It is not accurate to say that one programming language is inherently “better” than another, as the choice of language often depends on the specific use case and the individual’s personal preferences. However, R and Python both have their own strengths and weaknesses.

R is particularly well-suited for data analysis and visualization, and it has a large number of libraries and packages specifically designed for these tasks. R’s syntax is also designed to make it easy to manipulate and analyze data.

Python, on the other hand, is a general-purpose programming language that is widely used in a variety of fields, including web development, machine learning, and scientific computing. It has a large and active community that maintains a wide variety of libraries and packages for many different tasks. Python’s simple and easy-to-learn syntax makes it a popular choice for beginners.

Ultimately, the choice between R and Python will depend on the specific task you are trying to accomplish and your personal preferences as a developer. Both languages are powerful and have a lot to offer, and many data scientists use both languages in their work.

Major differences between R and Python

	R	Python
Syntax	Functional language	Object-oriented language
Statistical learning	Developed by statisticians for statistical analysis	Meh
Machine learning	tidymodels torch API interfaces to Python	scikit-learn tensorflow pytorch
Visualization	ggplot2	matplotlib + others
Package management	CRAN	pip/virtualenv/PyPI/Anaconda
Speed	Somewhat slower	Somewhat faster
Community	Academia and industry	Larger (general-purpose programming language)

tidyverse

tidyverse.org

The tidyverse is an opinionated collection of R packages designed for data science
All packages share an underlying philosophy and a common grammar

Quarto

Fully reproducible documents – each time you render the analysis is run from the beginning
YAML header to define document settings
Code goes in chunks
Narrative goes outside of chunks
A visual editor for a familiar / Google docs-like editing experience
Plain-text file format for easy editing and version control
More robust and flexible compared to Jupyter Notebooks

But you can still use the Jupyter engine to run Python natively

Tour: Quarto

RStudio IDE with a Quarto document, source code on the left and output on the right. Annotated to show the YAML, a link, a header, and a code chunk.

Environments

Important

The environment of your Quarto document is separate from the Console!

Remember this, and expect it to bite you a few times as you’re learning to work with Quarto!

Environments

First, run the following in the console:

x <- 2
x * 3

[1] 6

All looks good, eh?

Then, add the following in an R chunk in your Quarto document

x * 3

Error in eval(expr, envir, enclos): object 'x' not found

What happens? Why the error?

How will we use Quarto?

Every assignment / report / project / etc. is a Quarto document
You’ll always have a template Quarto document to start with
The amount of scaffolding in the template will decrease over the semester

The Bechdel test

In order to pass the test, a movie must have

👭 At least two named women in it
🗣️ Who talk to each other
🚫 About something besides a man

The Bechdel test

Your turn!

ae-00-bechdel-quarto

Go to ae-00-bechdel-quarto and clone the repo to RStudio Workbench.
Open and render the Quarto document bechdel.qmd, review the document, and fill in the blanks.

Warning

ae-00-bechdel-quarto is hosted on GitHub.com because we have not configured your authentication method for Cornell’s GitHub. We will do this tomorrow in lab.

Toolkit: Version control and collaboration

Git and GitHub

Git logo

Git is a version control system – like “Track Changes” features from Microsoft Word, on steroids
It’s not the only version control system, but it’s a very popular one

GitHub logo

GitHub is the home for your Git-based projects on the internet – like DropBox but much, much better
We will use GitHub (Enterprise) as a platform for web hosting and collaboration

Versioning

Versioning

with human readable messages

How we use Git and GitHub

Kevin McCallister in Home Alone cocking an air rifle and saying 'Don't get scared now'.

Git and GitHub tips

There are hundreds of Git commands – you don’t have to know them all. 99% of the time you will use Git to add, commit, push, and pull.
We will be doing Git things and interfacing with GitHub through RStudio, but if you Google for help you might come across methods for doing these things in the command line – skip that and move on to the next resource unless you feel comfortable trying it out.
There is a great resource for working with git and R: happygitwithr.com. Some of the content in there is beyond the scope of this course, but it’s a good place to look for help.

Tour: Git + GitHub

In lab section
Make sure to access Cornell’s GitHub so I can add you to the course organization on GitHub

Meet the toolkit

Announcements

Announcements

Making INFO 2950 a success

Five tips for success

Course FAQ

Course FAQ

Course toolkit

Toolkit: Computing

Learning goals

Reproducible data analysis

Reproducibility checklist

Toolkit for reproducibility

R and RStudio

R and RStudio

R vs. RStudio

R packages

Tour: R and RStudio

A short list (for now) of R essentials

R essentials (continued)

R and Python

Major differences between R and Python

tidyverse

Quarto

Quarto

Tour: Quarto

Environments

Environments

How will we use Quarto?

The Bechdel test

The Bechdel test

The Bechdel test

Your turn!

Toolkit: Version control and collaboration

Git and GitHub

Versioning

Versioning

with human readable messages

How we use Git and GitHub

Git and GitHub tips

Tour: Git + GitHub

Avatar