library(tidyverse)
library(tidymodels)
HW 08 - Predicting Kickstarter funding
This homework is due Tuesday, May 7 at 11:59pm ET. The slip day deadline is Thursday, May 9 at 11:59pm. Regardless of whether you submit on Wednesday or Thursday, it will still count as one slip day.
Getting started
Go to the info2950-sp24 organization on GitHub. Click on the repo with the prefix hw-08. It contains the starter documents you need to complete the homework.
Clone the repo and start a new project in RStudio. See the Lab 0 instructions for details on cloning a repo and starting a new R project.
Workflow + formatting
Make sure to
- Update author name on your document.
- Label all code chunks informatively and concisely.
- Follow the Tidyverse code style guidelines.
- Make at least 3 commits.
- Resize figures where needed, avoid tiny or huge plots.
- Use informative labels for plot axes, titles, etc.
- Consider aesthetic choices such as color, legend position, etc.
- Turn in an organized, well formatted document.
You will estimate a series of machine learning models for this homework assignment. I strongly encourage you to make use of code caching in the Quarto document to decrease the rendering time for the document.
Data and packages
We’ll use the tidyverse and tidymodels packages for this assignment. You are welcome and encouraged to load additional packages if you desire.
Kickstarter
Kickstarter is an American public benefit corporation based in Brooklyn, New York, that maintains a global crowdfunding platform focused on creativity.1 Individuals can propose various types of projects on the platform, with specified funding goals and campaign durations. If projects receive enough funding, they are considered “funded” and the project creator receives the funds. If projects do not receive enough funding, they are considered “not funded” and the project creator does not receive the funds.
In this assignment you will estimate a series of machine learning models to predict whether or not a project will be funded based on short “blurbs” written by the project’s creators (typically between 30 and 130 characters in length).2
There are two files in the data
folder:
kickstarter-train.rds
- this contains the training set of observationskickstarter-test.rds
- this contains the test set of observations
We have already split the data into training/test sets for you. You do not need to use initial_split()
to partition the data. Unless otherwise specified, all models should be fit using 10-fold cross-validation.
Exercises
Exercise 1
Fit a null model to predict whether or not a project is funded. Report the accuracy and ROC AUC, and interpret them in the context of the model.
When fitting a null model, parsnip doesn’t actually use the specified predictor(s) to fit the model. However you still need to explicitly provide a model formula to the fit()
function. Since the blurb
is a character vector, I recommend using only .id
as the “predictor” variable.
Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 2
Fit a naive Bayes model. Use the text description from blurb
to fit a naive Bayes model predicting whether or not the project was funded.3
At minimum, create an appropriate feature engineering recipe to tokenize the data, retain the 5000 most frequently occurring tokens, and calculate tf-idf scores. But you are encouraged (and will likely be rewarded) for going beyond the minimum and using additional pre-processing steps based on your understanding of the data.
Report the accuracy and ROC AUC values for this model, along with the confusion matrix for the predictions. How does this model perform? Which outcome is it more likely to predict incorrectly?
Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 3
Fit a lasso regression model. Estimate a lasso logistic regression model to predict whether or not the project is funded.
A lasso regression model is a form of penalized regression where the mixture
hyperparameter is set to 1.
At minimum, create an appropriate feature engineering recipe to tokenize the data, retain the 500 most frequently occurring tokens, and calculate tf-idf scores.4 But you are encouraged (and will likely be rewarded) for going beyond the minimum and using additional pre-processing steps based on your understanding of the data.
Tune the model over the penalty
hyperparameter using a regular grid of at least 30 values.
Check out dials::grid_regular()
.
Tune the model using the cross-validated folds and the glmnet
engine, and report the ROC AUC values for the five best models. Use autoplot()
to inspect the performance of the models. How do these models perform?
Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 4
Use sparse encoding to improve model fitting efficiency. Review 7.5 Case study: sparse encoding from your preparation materials. Use sparse encoding to improve the efficiency of fitting the lasso regression model from exercise 3.
Perform the same tuning process again using the sparse-encoded dataset, and report the ROC AUC values for the five best models. Use autoplot()
to inspect the performance of the models. How do these models perform compared to the models from exercise 3? What, if anything, did you notice about the runtime?
Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 5
Build a better recipe. Revise the feature engineering recipe from the lasso model to improve its performance. At minimum, you should:
- Remove stop words
- Stem the tokens
- Calculate all possible 1-grams, 2-grams, and 3-grams
- Generate additional text features using
step_textfeature()
. This creates a series of numeric features based on the original character strings. - Ensure all predictors are normalized to the same mean and variance
But you are encouraged (and will likely be rewarded) for going beyond the minimum and using additional pre-processing steps based on your understanding of the data.
Tune the model using the cross-validated folds and the glmnet
engine, and report the ROC AUC values for the five best models. Use autoplot()
to inspect the performance of the models. How do these models perform?
Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 6
Fit a model of your own choosing. Fit a model of your own choosing to predict whether or not the project is funded. You are responsible for implementing appropriate feature engineering and/or hyperparameter tuning for the model.
Briefly summarize how you decided on the workflow for each model (e.g. feature engineering + model specification). How does this model perform compared to the previous models? Report relevant metrics and plots to support your conclusions.
Credit will be earned based on the effort applied to fitting appropriate models and utilizing the techniques taught in this class. Do the minimum and you can expect to earn minimal credit.
Now is a good time to render, commit (with a descriptive and concise commit message), and push again. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 7
Pick the best performing model. Select a single model to train using the entire training set and provide a brief written justification as to why you chose this specific model.
Fit the recipe + model using the full training set. Report the accuracy and ROC AUC values for this model, along with the ROC curve and confusion matrix for the predictions. How does this model perform? Does it perform equally well for projects that were and were not funded, or does it have a built-in bias towards one specific outcome?
Finally, report the top 20 most relevant features of the model.5 What features were most relevant? Do these features make sense?
Render, commit, and push one last time. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Wrap up
Submission
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Cornell University NetID and log in using your NetID credentials.
- Click on your INFO 2950 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”).
- Select all pages of your .pdf submission to be associated with the “Workflow & formatting” question.
Grading
- Exercise 1: 4 points
- Exercise 2: 8 points
- Exercise 3: 8 points
- Exercise 4: 4 points
- Exercise 5: 8 points
- Exercise 6: 8 points
- Exercise 7: 6 points
- Workflow + formatting: 4 points
- Total: 50 points
The “Workflow & formatting” component assesses the reproducible workflow. This includes:
- Following tidyverse code style
- All code being visible in rendered PDF (no more than 80 characters)
- Appropriate figure sizing, and figures with informative labels and legends
- Ensuring reproducibility by setting a random seed value.
Footnotes
Source: Wikipedia↩︎
The dataset comes from Supervised Machine Learning for Text Analysis in R by Emil Hvitfeldt and Julia Silge and has been lightly modified for use in this assignment.↩︎
Don’t recognize this model type? Look back at the preparation materials for this week.↩︎
Lasso regression requires all features to be scaled and normalized so they have the same mean and variance. Fortunately for us, by definition tf-idf scores already are normalized so you do not have to explicitly perform this feature engineering step if all features are tf-idf scores.↩︎
If the final model uses penalized regression, report the top 20 most relevant features for both outcomes. For all other model types, feature importance is measured the same for both outcomes.↩︎