Unsupervised learning: Dimension reduction

Lecture 25

Dr. Benjamin Soltoff

Cornell University
INFO 2950 - Spring 2024

April 25, 2024

Announcements

Peer review in lab
Homework 07

What is unsupervised learning?

Dimension reduction

Transformation of data from a high-dimensional space into a low-dimensional space
Low-dimensional representation retains meaningful properties of the original data

Motivations for dimension reduction

Exploratory analysis and visualization
Resolving multicollinearity in regression features
Dealing with sparse data (curse of dimensionality)

Dimension reduction methods

Linear methods

Principal components analysis (PCA)
Singular value decomposition (SVD)
Linear discriminant analysis (LDA)

Nonlinear methods

Latent semantic analysis (LSA)
Latent Direchlet allocation (LDA)
T-distributed stochastic neighbor embedding (t-SNE)
Uniform manifold approximation and projection (UMAP)

Applications

Netflix prize

Objective: Build a collaborative filtering algorithm to predict user ratings for films based on previous ratings.
Performance metric: Improve on the existing model by 10% (based on test set RMSE)
Data: 100,480,507 ratings from 480,189 users on 17,770 movies
Prize: $1,000,000

Winning approach

Singular value decomposition

Singular value decomposition

Measuring legislator ideology

What is political ideology? How do you measure it for a legislator?

02:00

Defining ideology on a liberal-conservative spectrum

How do you locate legislators on this spectrum?
How liberal/conservative are members of the same political party?
How do you compare them over time (e.g. how liberal a Democrat was in 1870 vs. 1995)?

DW-NOMINATE

Use roll-call votes to measure ideology
- Over 100,000 recorded votes over time in the U.S. Congress
Dimension reduction to reduce the number of meaningful dimensions of variation across legislators over time
Results in two dimensions:
1. Political ideology
2. “Issue of the day” (e.g. slavery, civil rights, economic policy)

Polarization over time

Principal components analysis

Reduce the dimensionality of the data set through linear combinations of features $X_1, X_2, \ldots, X_p$
Common purposes
- Feature reduction
- Reduce multicollinearity
- Data visualization

Principal components

Explain most of the variability in the data with a smaller number of variables
Low-dimensional representation of high-dimensional data, with minimal information loss
Small number of dimensions
- Amount that the observations vary along each dimension
Linear combination of $p$ features
Requires standardized features

First principal component

Data set $X_1, X_2, \ldots, X_p$

\[Z_1 = \phi_{11}X_1 + \phi_{21}X_2 + \dots + \phi_{p1}X_p\]
$\max \mathrm{Var}(Z_1)$
$\phi_1$ - first principal component loading vector
- $p$ individual loadings $\phi_{11}, \dots, \phi_{p1}$
Normalized $\phi$

\[\sum_{j=1}^p \phi_{j1}^2 = 1\]

Second principal component

\[Z_2 = \phi_{12}X_1 + \phi_{22}X_2 + \dots + \phi_{p2}X_p\]

\[\max \mathrm{Var}(Z_2)\]

$\sum_{j=1}^p \phi_{j2}^2 = 1$
$Z_1, Z_2$ uncorrelated (orthogonal)

$n$th principal component

$\min(p, n)$ unique principal components
As $j$ increases, variance of $Z_j$ constantly decreasing

Fashion MNIST dataset

# A tibble: 60,000 × 785
   label  pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 pixel10
   <fct>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
 1 Pullo…      0      0      0      0      0      0      0      0      0       0
 2 Ankle…      0      0      0      0      0      0      0      0      0       0
 3 Shirt       0      0      0      0      0      0      0      5      0       0
 4 T-shi…      0      0      0      1      2      0      0      0      0       0
 5 Dress       0      0      0      0      0      0      0      0      0       0
 6 Coat        0      0      0      5      4      5      5      3      5       6
 7 Coat        0      0      0      0      0      0      0      0      0       0
 8 Sandal      0      0      0      0      0      0      0      0      0       0
 9 Coat        0      0      0      0      0      0      3      2      0       0
10 Bag         0      0      0      0      0      0      0      0      0       0
# ℹ 59,990 more rows
# ℹ 774 more variables: pixel11 <dbl>, pixel12 <dbl>, pixel13 <dbl>,
#   pixel14 <dbl>, pixel15 <dbl>, pixel16 <dbl>, pixel17 <dbl>, pixel18 <dbl>,
#   pixel19 <dbl>, pixel20 <dbl>, pixel21 <dbl>, pixel22 <dbl>, pixel23 <dbl>,
#   pixel24 <dbl>, pixel25 <dbl>, pixel26 <dbl>, pixel27 <dbl>, pixel28 <dbl>,
#   pixel29 <dbl>, pixel30 <dbl>, pixel31 <dbl>, pixel32 <dbl>, pixel33 <dbl>,
#   pixel34 <dbl>, pixel35 <dbl>, pixel36 <dbl>, pixel37 <dbl>, …

PCA on pixel features

Selecting the number of PCs

Total variance in a dataset

\[\sum_{j=1}^p \mathrm{Var}(X_j) = \sum_{j=1}^p \frac{1}{N} \sum_{i=1}^N x_{ij}^2\]

Variance explained by the $m$th principal component

\[\frac{1}{N} \sum_{i=1}^N z_{im}^2 = \frac{1}{N} \sum_{i=1}^N \left( \sum_{j=1}^p \phi_{jm} x_{ij} \right)^2\]

Proportion of variance explained by the $m$th principal component

\[\text{PVE} = \frac{\sum_{i=1}^N \left( \sum_{j=1}^p \phi_{jm} x_{ij} \right)^2}{\sum_{j=1}^p \sum_{i=1}^N x_{ij}^2}\]

\[\text{PVE} = \frac{\lambda_m}{\sum_{j=1}^p \lambda_j}\]

Variance explained

Uniform Manifold Approximation and Projection (UMAP)

PCA assumes linear combinations of features
Non-linear dimension reduction
Looks for a low-dimensional representation of the data that preserves the structure of the data as much as possible
Very useful for exploratory analysis (and potentially feature generation)

UMAP on fashion MNIST

Application exercise

`ae-23`

Go to the course GitHub org and find your ae-23 (repo name will be suffixed with your GitHub name).
Clone the repo in RStudio Workbench, open the Quarto document in the repo, and follow along and complete the exercises.
Render, commit, and push your edits by the AE deadline – end of tomorrow

Recap

Unsupervised learning is about finding patterns/structure in data
Dimension reduction reduce a data set from a high-dimensional space to a low-dimensional space
Dimension reduction is useful for visualization, noise reduction, and feature generation
PCA is a linear dimension reduction technique that finds the directions of maximum variance in the data

Unsupervised learning: Dimension reduction

Announcements

Announcements

What is unsupervised learning?

Dimension reduction

Dimension reduction

Motivations for dimension reduction

Dimension reduction methods

Linear methods

Nonlinear methods

Applications

Netflix prize

Winning approach

Singular value decomposition

Singular value decomposition

Measuring legislator ideology

DW-NOMINATE

Polarization over time

Polarization over time

Principal components analysis

Principal components analysis

Principal components

First principal component

Second principal component

\(n\)th principal component

Fashion MNIST dataset

Fashion MNIST dataset

Fashion MNIST dataset

PCA on pixel features

Selecting the number of PCs

Variance explained

Uniform Manifold Approximation and Projection (UMAP)

UMAP on fashion MNIST

Application exercise

`ae-23`

Recap

If you torture the data long enough…