library(tidyverse)
library(tidymodels)
library(probably)
library(arthistory)
library(skimr)
Gender representation in art history textbooks
In this application exercise we will use logistic regression to predict the gender of artists with works included in Gardner’s Art Through the Ages, one of the most widely used art history textbooks in the United States. The data was originally compiled by Holland Stam for an undergraduate thesis.
The dataset is called worksgardner
and can be found in the arthistory package, and we will use tidyverse and tidymodels for data exploration and modeling, respectively.
Please read the following context1 and take a skim
of the data set before we get started.
This dataset contains data that was used for Holland Stam’s thesis work, titled Quantifying art historical narratives. The data was collected to assess the demographic representation of artists through editions of Janson’s History of Art and Gardner’s Art Through the Ages, two of the most popular art history textbooks used in the American education system. In this package specifically, both artist-level and work-level data was collected along with variables regarding the artists’ demographics and numeric metrics for describing how much space they or their work took up in each edition of each textbook.
worksgardner
: Contains individual work-level data by edition of Gardner’s art history textbook from 1926 until 2020. For each work, there is information about the size of the work as displayed in the textbook as well as the size of the accompanying descriptive text. Demographic data about the artist is also included.
data("worksgardner")
skim(worksgardner)
Prep data for modeling
Before we attempt to model the data, first we will perform some data cleaning.
- Remove rows with
NA
values or implausible values. - Convert
artist_gender
to a factor column. - Log-transform
area_of_work_in_book
due to skewness. - Lump infrequent nationalities into a single “Other” category.
<- worksgardner |>
works_subset # remove rows with NAs or implausible values
filter(
%in% c("Male", "Female"),
artist_gender > 0,
area_of_work_in_book != "N/A",
artist_race != "N/A"
artist_ethnicity |>
) # minor feature engineering
mutate(
# convert artist_gender to factor column
artist_gender = factor(artist_gender, levels = c("Male", "Female")),
# log-transform area due to skewness
area_of_work_in_book = log(area_of_work_in_book),
# lump infrequent nationalities into a single "Other" category
artist_nationality = fct_lump_n(f = artist_nationality, n = 6)
|>
) # select columns for modeling
select(
artist_gender, publication_year, area_of_work_in_book,
artist_nationality, artist_race, artist_ethnicity
)glimpse(works_subset)
Split data into training/test sets
Demo: Now that we have cleaned the data, let’s split it into training and test sets. We will allocate 80% for training purposes (fitting a model) and 20% for testing purposes (evaluating model performance). Note the use of set.seed()
so that every time we run the code we get the exact same split.
set.seed(123)
# split data into training and test sets
<- initial_split(data = works_subset, prop = 0.8)
artist_split
# extract training/test sets as data frames
<- training(artist_split)
artist_train <- testing(artist_split) artist_test
Fit a simple logistic regression model
Your turn: Estimate a simple logistic regression model to predict the artist’s gender as a function of when the textbook was published.
# fit the logistic regression model
<- TODO() |>
gender_year_fit fit(TODO ~ TODO, data = TODO)
tidy(gender_year_fit)
What do the estimated parameters tell us at this point? Add response here.
Visualize predicted probabilities
Your turn: Generate a plot to visualize the predicted probability that an artist is female based on the model estimated above.
# generate a sequence of years to predict
<- tibble(
art_years publication_year = seq(
from = min(works_subset$publication_year),
to = max(works_subset$publication_year)
)
)
# generate predicted probabilities
<- predict(gender_year_fit, new_data = art_years, type = "prob") |>
gender_year_pred bind_cols(art_years)
# visualize predicted probabilities
ggplot(data = gender_year_pred, mapping = aes(x = publication_year, y = .pred_Female)) +
geom_line() +
scale_y_continuous(labels = label_percent()) +
labs(
x = "Publication year",
y = "Predicted probability artist is female"
)
What are your takeaways? Add response here.
Evaluate model’s performance
Demo: Estimate the model’s performance on the test set using metrics such as accuracy, confusion matrix, sensitivity, and specificity. What do you learn about the model’s performance?
# generate test set predictions
<- predict(gender_year_fit, artist_test) |>
gender_year_pred bind_cols(artist_test |> select(artist_gender))
# model metrics
conf_mat(gender_year_pred, truth = artist_gender, .pred_class)
accuracy(gender_year_pred, truth = artist_gender, .pred_class)
sensitivity(gender_year_pred, truth = artist_gender, .pred_class)
specificity(gender_year_pred, truth = artist_gender, .pred_class)
Add response here.
Fit a multiple variable logistic regression model
Your turn: Fit a multiple variable logistic regression model to predict the artist’s gender as a function of all available predictors in the data set.
# fit logistic regression model
<- TODO() |>
gender_many_fit fit(TODO ~ TODO, data = TODO)
tidy(gender_many_fit)
Evaluate model’s performance
Your turn: Run the code chunk below and evaluate the full model’s performance. What do the metrics reveal? How useful do you find this model to be?
# bundle multiple metrics together in a single function
<- metric_set(accuracy, sensitivity, specificity)
multi_metric
# generate test set predictions
<- predict(gender_many_fit, artist_test) |>
gender_many_pred bind_cols(artist_test |> select(artist_gender))
<- predict(gender_many_fit, artist_test, type = "prob") |>
gender_many_prob bind_cols(artist_test |> select(artist_gender))
# model metrics
conf_mat(gender_many_pred, truth = artist_gender, .pred_class)
multi_metric(gender_many_pred, truth = artist_gender, estimate = .pred_class)
# ROC curve and AUC
|>
gender_many_prob roc_curve(
# true outcome
truth = artist_gender,
# predicted probability artist is female
.pred_Female,# ensure "female" is treated as the event of interest
event_level = "second"
|>
) # draw the ROC curve
autoplot()
|>
gender_many_prob roc_auc(
truth = artist_gender,
.pred_Female,event_level = "second"
)
Add response here.
Adjust prediction threshold
Demo: Instead of attempting to fit a new model, let’s adjust the prediction threshold of the full model to see if we can improve its performance. Test several different prediction threshold values (e.g. any value between 0 and 1) and evaluate the model’s performance at each threshold. What are the trade-offs for different threshold values?
threshold_perf(
# data frame containing predicted probabilities
.data = gender_many_prob,
# true outcome
truth = artist_gender,
# predicted probability artist is female
estimate = .pred_Female,
# sequence of threshold values to test
thresholds = seq(0, .5, by = 0.0025),
# metrics to calculate
metrics = multi_metric,
# ensure "female" is treated as the event of interest
event_level = "second"
)
ggplot(threshold_data, aes(x = .threshold, y = .estimate, color = .metric)) +
geom_line() +
theme_minimal() +
scale_color_viridis_d(end = 0.9) +
labs(
x = "'Male' Threshold\n(above this value is predicted 'male')",
y = "Metric Estimate",
title = "Balancing performance by varying the threshold"
)
Add response here.
Footnotes
Courtesy of #TidyTuesday↩︎