Gender representation in art history textbooks

Application exercise

Modified

March 26, 2024

In this application exercise we will use logistic regression to predict the gender of artists with works included in Gardner’s Art Through the Ages, one of the most widely used art history textbooks in the United States. The data was originally compiled by Holland Stam for an undergraduate thesis.

The dataset is called worksgardner and can be found in the arthistory package, and we will use tidyverse and tidymodels for data exploration and modeling, respectively.

library(tidyverse)
library(tidymodels)
library(probably)
library(arthistory)
library(skimr)

Please read the following context¹ and take a skim of the data set before we get started.

This dataset contains data that was used for Holland Stam’s thesis work, titled Quantifying art historical narratives. The data was collected to assess the demographic representation of artists through editions of Janson’s History of Art and Gardner’s Art Through the Ages, two of the most popular art history textbooks used in the American education system. In this package specifically, both artist-level and work-level data was collected along with variables regarding the artists’ demographics and numeric metrics for describing how much space they or their work took up in each edition of each textbook.

worksgardner: Contains individual work-level data by edition of Gardner’s art history textbook from 1926 until 2020. For each work, there is information about the size of the work as displayed in the textbook as well as the size of the accompanying descriptive text. Demographic data about the artist is also included.

data("worksgardner")
skim(worksgardner)

Prep data for modeling

Before we attempt to model the data, first we will perform some data cleaning.

Remove rows with NA values or implausible values.
Convert artist_gender to a factor column.
Log-transform area_of_work_in_book due to skewness.
Lump infrequent nationalities into a single “Other” category.

works_subset <- worksgardner |>
  # remove rows with NAs or implausible values
  filter(
    artist_gender %in% c("Male", "Female"),
    area_of_work_in_book > 0,
    artist_race != "N/A",
    artist_ethnicity != "N/A"
  ) |>
  # minor feature engineering
  mutate(
    # convert artist_gender to factor column
    artist_gender = factor(artist_gender, levels = c("Male", "Female")),
    # log-transform area due to skewness
    area_of_work_in_book = log(area_of_work_in_book),
    # lump infrequent nationalities into a single "Other" category
    artist_nationality = fct_lump_n(f = artist_nationality, n = 6)
  ) |>
  # select columns for modeling
  select(
    artist_gender, publication_year, area_of_work_in_book,
    artist_nationality, artist_race, artist_ethnicity
  )
glimpse(works_subset)

Split data into training/test sets

Demo: Now that we have cleaned the data, let’s split it into training and test sets. We will allocate 80% for training purposes (fitting a model) and 20% for testing purposes (evaluating model performance). Note the use of set.seed() so that every time we run the code we get the exact same split.

set.seed(123)

# split data into training and test sets
artist_split <- initial_split(data = works_subset, prop = 0.8)

# extract training/test sets as data frames
artist_train <- training(artist_split)
artist_test <- testing(artist_split)

Fit a simple logistic regression model

Your turn: Estimate a simple logistic regression model to predict the artist’s gender as a function of when the textbook was published.

# fit the logistic regression model
gender_year_fit <- TODO() |>
  fit(TODO ~ TODO, data = TODO)
tidy(gender_year_fit)

What do the estimated parameters tell us at this point? Add response here.

Visualize predicted probabilities

Your turn: Generate a plot to visualize the predicted probability that an artist is female based on the model estimated above.

# generate a sequence of years to predict
art_years <- tibble(
  publication_year = seq(
    from = min(works_subset$publication_year),
    to = max(works_subset$publication_year)
  )
)

# generate predicted probabilities
gender_year_pred <- predict(gender_year_fit, new_data = art_years, type = "prob") |>
  bind_cols(art_years)

# visualize predicted probabilities
ggplot(data = gender_year_pred, mapping = aes(x = publication_year, y = .pred_Female)) +
  geom_line() +
  scale_y_continuous(labels = label_percent()) +
  labs(
    x = "Publication year",
    y = "Predicted probability artist is female"
  )

What are your takeaways? Add response here.

Evaluate model’s performance

Demo: Estimate the model’s performance on the test set using metrics such as accuracy, confusion matrix, sensitivity, and specificity. What do you learn about the model’s performance?

# generate test set predictions
gender_year_pred <- predict(gender_year_fit, artist_test) |>
  bind_cols(artist_test |> select(artist_gender))

# model metrics
conf_mat(gender_year_pred, truth = artist_gender, .pred_class)

accuracy(gender_year_pred, truth = artist_gender, .pred_class)
sensitivity(gender_year_pred, truth = artist_gender, .pred_class)
specificity(gender_year_pred, truth = artist_gender, .pred_class)

Add response here.

Fit a multiple variable logistic regression model

Your turn: Fit a multiple variable logistic regression model to predict the artist’s gender as a function of all available predictors in the data set.

# fit logistic regression model
gender_many_fit <- TODO() |>
  fit(TODO ~ TODO, data = TODO)
tidy(gender_many_fit)

Evaluate model’s performance

Your turn: Run the code chunk below and evaluate the full model’s performance. What do the metrics reveal? How useful do you find this model to be?

# bundle multiple metrics together in a single function
multi_metric <- metric_set(accuracy, sensitivity, specificity)

# generate test set predictions
gender_many_pred <- predict(gender_many_fit, artist_test) |>
  bind_cols(artist_test |> select(artist_gender))
gender_many_prob <- predict(gender_many_fit, artist_test, type = "prob") |>
  bind_cols(artist_test |> select(artist_gender))

# model metrics
conf_mat(gender_many_pred, truth = artist_gender, .pred_class)
multi_metric(gender_many_pred, truth = artist_gender, estimate = .pred_class)

# ROC curve and AUC
gender_many_prob |>
  roc_curve(
    # true outcome
    truth = artist_gender,
    # predicted probability artist is female
    .pred_Female,
    # ensure "female" is treated as the event of interest
    event_level = "second"
  ) |>
  # draw the ROC curve
  autoplot()

gender_many_prob |>
  roc_auc(
    truth = artist_gender,
    .pred_Female,
    event_level = "second"
  )

Add response here.

Adjust prediction threshold

Demo: Instead of attempting to fit a new model, let’s adjust the prediction threshold of the full model to see if we can improve its performance. Test several different prediction threshold values (e.g. any value between 0 and 1) and evaluate the model’s performance at each threshold. What are the trade-offs for different threshold values?

threshold_perf(
  # data frame containing predicted probabilities
  .data = gender_many_prob,
  # true outcome
  truth = artist_gender,
  # predicted probability artist is female
  estimate = .pred_Female,
  # sequence of threshold values to test
  thresholds = seq(0, .5, by = 0.0025),
  # metrics to calculate
  metrics = multi_metric,
  # ensure "female" is treated as the event of interest
  event_level = "second"
)

ggplot(threshold_data, aes(x = .threshold, y = .estimate, color = .metric)) +
  geom_line() +
  theme_minimal() +
  scale_color_viridis_d(end = 0.9) +
  labs(
    x = "'Male' Threshold\n(above this value is predicted 'male')",
    y = "Metric Estimate",
    title = "Balancing performance by varying the threshold"
  )

Add response here.

Footnotes

Courtesy of #TidyTuesday ↩︎