Lecture 2
Cornell University
INFO 2950 - Spring 2024
January 25, 2024
Discussion sections merging
If you cannot access RStudio Workbench yet, email info2950@cornell.edu
Application exercises 00 are ungraded
Complete homework 00 in Canvas
Q - What data science background does this course assume?
A - None! Sort of…
Q - Is this an intro stats course?
A - No. We presume you have already met the prereq and taken one of AEM 2100, BTRY 3010, CEE 3040, ECON 3110, ECON 3130, ENGRD 2700, ILRST 2100, MATH 1710, PAM 2100, PSYCH 2500, SOC 3010, STSCI 2100, STSCI 2150, STSCI 2200 🙄
While statistics \(\ne\) data science, they are very closely related and have tremendous of overlap.
Q - Will we be doing computing?
A - Yes! Lots of it.
Q - Is this an intro CS course?
A - No – you’ve already taken CS 1110 or 1112
Q - What computing language will we learn?
A - R.
Q: Why not Python?
A: Come meet with me during office hours and we can talk about it!
Q: I don’t want to learn R! Will the course use Python again?
A: Probably when someone else teaches the course next year.
Course operation
Doing data science
By the end of the semester, you will…
What does it mean for a data analysis to be “reproducible”?
Near-term goals:
Long-term goals:
Packages: Fundamental units of reproducible R code, including reusable R functions, the documentation that describes how to use them, and sample data1
As of January 2024, there are over 20,000 R packages available on CRAN (the Comprehensive R Archive Network)2
We’re going to work with a small (but important) subset of these!
R is a functional language
Functions are (most often) verbs, followed by what they will be applied to in parentheses:
Packages are installed with the install.packages()
function and loaded with the library
function, once per session:
You never rename packages when you load them:
$
:It is not accurate to say that one programming language is inherently “better” than another, as the choice of language often depends on the specific use case and the individual’s personal preferences. However, R and Python both have their own strengths and weaknesses.
R is particularly well-suited for data analysis and visualization, and it has a large number of libraries and packages specifically designed for these tasks. R’s syntax is also designed to make it easy to manipulate and analyze data.
Python, on the other hand, is a general-purpose programming language that is widely used in a variety of fields, including web development, machine learning, and scientific computing. It has a large and active community that maintains a wide variety of libraries and packages for many different tasks. Python’s simple and easy-to-learn syntax makes it a popular choice for beginners.
Ultimately, the choice between R and Python will depend on the specific task you are trying to accomplish and your personal preferences as a developer. Both languages are powerful and have a lot to offer, and many data scientists use both languages in their work.
R | Python | |
---|---|---|
Syntax | Functional language | Object-oriented language |
Statistical learning | Developed by statisticians for statistical analysis | Meh |
Machine learning |
|
|
Visualization | ggplot2 | matplotlib + others |
Package management | CRAN | pip/virtualenv/PyPI/Anaconda |
Speed | Somewhat slower | Somewhat faster |
Community | Academia and industry | Larger (general-purpose programming language) |
Fully reproducible documents – each time you render the analysis is run from the beginning
YAML header to define document settings
Code goes in chunks
Narrative goes outside of chunks
A visual editor for a familiar / Google docs-like editing experience
Plain-text file format for easy editing and version control
More robust and flexible compared to Jupyter Notebooks
But you can still use the Jupyter engine to run Python natively
Important
The environment of your Quarto document is separate from the Console!
Remember this, and expect it to bite you a few times as you’re learning to work with Quarto!
In order to pass the test, a movie must have
ae-00-bechdel-quarto
ae-00-bechdel-quarto
and clone the repo to RStudio Workbench.bechdel.qmd
, review the document, and fill in the blanks.Warning
ae-00-bechdel-quarto
is hosted on GitHub.com because we have not configured your authentication method for Cornell’s GitHub. We will do this tomorrow in lab.
GitHub is the home for your Git-based projects on the internet – like DropBox but much, much better
We will use GitHub (Enterprise) as a platform for web hosting and collaboration