Dimensionality Reduction and Topic Modeling

Author

Phil Chodrow

Introduction: Unsupervised Learning

So far in this course, we’ve considered the supervised learning framework. In the supervised framework, we have a matrix of predictive features \(\mathbf{X}\in \mathbb{R}^{n\times p}\) and a vector of targets \(\mathbf{y}\in \mathbb{R}^n\). Our aim is to find a function \(f:\mathbb{R}^p\rightarrow \mathbb{R}\) such that \(f(\mathbf{x}) \approx y\); that is, \(f\) can be used to make reasonable predictions of a new target \(y\) on the basis of new predictors \(\mathbf{x}\).

Unsupervised learning refers to machine learning techniques in which we do not have access to any targets \(\mathbf{y}\). Instead, we only have the features \(\mathbf{X}\). In general, the aim of unsupervised learning is not to make predictions, but rather to “find some structure” in the features \(\mathbf{X}\).

Dimensionality Reduction

On the other hand, dimensionality reduction is the task of identifying similar or related features (columns of \(\mathbf{X}\)). This often allows us to identify patterns in the data that we wouldn’t be able to spot without algorithmic help. Dimensionality reduction is our topic for this lecture, and we’ll discuss clustering in the next one.

Matrix Factorization Methods

There are many approaches for dimensionality reduction. Here, we’re going to consider matrix factorization methods. Broadly speaking, in matrix factorization, we aim to write \(\mathbf{X}\approx \mathbf{U}\mathbf{W}\) for some matrices \(\mathbf{U}\) and \(\mathbf{W}\), usually subject to some constraints on these matrices. One constraint is direct: if \(\mathbf{X}\in \mathbb{R}^{n \times p}\), then we must have \(\mathbf{U}\in \mathbb{R}^{n \times k}\) and \(\mathbf{W}\in \mathbb{R}^{k\times p}\) for some value of \(k\). We sometimes think of \(k\) as a latent dimension for the problem, and aim to find a value of \(k \ll p\). The smaller \(k\) is, the more we have simplified our data, but small values of \(k\) may cause us to lose a lot of interesting structure that we might have wanted to capture.

To specify a matrix factorization model, we need to make three main choices:

What does “\(\approx\)” mean when we write \(\mathbf{X}\approx \mathbf{U}\mathbf{W}\)?
What choice of \(k\) will we use?
What other requirements will we place on \(\mathbf{U}\) and \(\mathbf{W}\)?

Principal Component Analysis

Principal component analysis (PCA) is a classical method in statistics for dimensionality reduction. Let’s answer each of the three questions above for PCA.

Approximation: the Frobenius Norm

In PCA, \(\mathbf{X}\approx \mathbf{U}\mathbf{W}\) means that \(\lVert \mathbf{X}- \mathbf{U}\mathbf{W} \rVert_F\) is small, where \(\lVert \cdot \rVert_F\) is the Frobenius norm. The Frobenius norm of a matrix is just the square root of the sum of its squared entries: \[ \lVert \mathbf{A} \rVert_F = \sqrt{\sum_{i = 1}^n \sum_{j = 1}^n a_{ij}^2}\;. \]

Choice of \(k\)

For PCA, we leave the choice of \(k\) to the user. There are heuristic ways for choosing \(k\), but we won’t discuss them in this course.

Constraints on \(\mathbf{U}\) and \(\mathbf{W}\)

In PCA, we place a certain special structure on the factorization matrices \(\mathbf{U}\) and \(\mathbf{W}\) that help us to interpret them:

\(\mathbf{W}\in \mathbb{R}^{k \times p}\)
\(\mathbf{U}= \mathbf{X}\mathbf{W}^T\). Note that this ensures that \(\mathbf{U}\in \mathbb{R}^{n\times k}\).
\(\mathbf{W}\mathbf{W}^T = \mathbf{I}\).

These assumptions can be derived statistically from the problem of finding the subspace (spanned by \(\mathbf{W}\)) that maximizes the variance of the data when projected onto that subspace.

The PCA Model

With these choices, the PCA optimization problem becomes:

\[ \begin{aligned} \hat{\mathbf{W}} =& \mathop{\mathrm{arg\,min}}_{\mathbf{W}\in \mathbb{R}^{n\times k}} \lVert \mathbf{X}- \mathbf{X}\mathbf{W}^T\mathbf{W} \rVert_{F} \\ &\text{Such that } \mathbf{W}\mathbf{W}^T = \mathbf{I} \end{aligned} \]

Now, we could aim to solve this problem with gradient descent (in the entries of \(\mathbf{W}\)) or some similar approach. This, however, is more complicated than needed. As it turns out, some nice theory from linear algebra gives us a formula for finding \(\mathbf{W}\) in terms of the eigenvalues of the matrix \(\mathbf{X}^T\mathbf{X}\). In particular:

Take a moment to convince yourself that \(\hat{\mathbf{W}}\) has dimensions \(k \times p\) as required.

\[ \hat{\mathbf{W}} = \left[\begin{matrix} - & \mathbf{v}_1 & - \\ - & \mathbf{v}_2 & - \\ \vdots & \vdots & \vdots \\ - & \mathbf{v}_k & - \\ \end{matrix}\right] \]

where \(\mathbf{v}_i\) is the \(i\)th eigenvector of \(\mathbf{X}^T\mathbf{X}\).

Implementing PCA

Let’s go ahead and implement PCA. We’ll do this using a term-document matrix that I derived from Lewis Caroll’s famous book Alice’s Adventures in Wonderland. Each row of the data corresponds to a chapter of the book, and each column corresponds to a word that may appear in that chapter:

Code

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

import nltk
from nltk.corpus import gutenberg
# nltk.download('gutenberg') # need to run once
from sklearn.feature_extraction.text import CountVectorizer

s = gutenberg.raw("carroll-alice.txt")
chapters = s.split("CHAPTER")[1:]
df = pd.DataFrame({
    "chapter" : range(1, len(chapters) + 1),
    "text" : chapters
})
df

vec = CountVectorizer(max_df = 0.5, min_df = 0, stop_words = "english")

counts = vec.fit_transform(df['text'])
counts = counts.toarray()
M = pd.DataFrame(counts, columns = vec.get_feature_names_out())

Now we’re ready to implement PCA. Because there is an explicit solution in terms of the eigenvalues of a matrix, our implementation can be very short.

This is a very slow implementation of PCA; faster methods involve numerical methods for computing singular values.

def pca(M, k):
    X = np.array(M)
    ev = np.linalg.eigh(X.T@X) # special eigensolver for symmetric matrices
    W = ev[1][:,-k:].T         # matrix of top k eigenvectors
    return X@W.T, W            # U and W

Here’s PCA in action:

k = 4
U, W = pca(M, k)

Interpreting PCA

Ok, so why did we do this? Both the matrices \(\mathbf{U}\) and \(\mathbf{W}\) can be used to give us information about the structure of the text. In the context of text analysis, we usually interpret \(\mathbf{U}\) and \(\mathbf{W}\) as expressing information about some latent topics. The \(j\) th column of \(\mathbf{W}\) gives us a weight for each term in the text; this tells us how present each term is in the topic. The following function can be used to inspect the top words in each topic:

def top_words_pca(M, W, component, num_words):
    orders = np.argsort(np.abs(W), axis = 1)
    important_words = np.array(M.columns)[orders]
    return important_words[component][-num_words:]

for i in range(k):
  print(f"topic {i}: {top_words_pca(M, W, i, 10)}")

topic 0: ['cats' 'jury' 'court' 'baby' 'mad' 'footman' 'caterpillar' 'mouse' 'cat'
 'king']
topic 1: ['gryphon' 'soldiers' 'tea' 'hare' 'march' 'cat' 'dormouse' 'king'
 'hatter' 'queen']
topic 2: ['tea' 'queen' 'hare' 'march' 'gryphon' 'king' 'mock' 'turtle' 'dormouse'
 'hatter']
topic 3: ['march' 'course' 'soup' 'dormouse' 'hatter' 'king' 'queen' 'gryphon'
 'mock' 'turtle']

On the other hand, the matrix \(\mathbf{U}= \mathbf{X}\mathbf{W}^T\) tells us about how present each of the \(k\) topics are in each of the documents (in this case, chapters of the book). We can visualize the topic weight in each chapter over time:

fig, ax = plt.subplots(1)

for i in range(k):
    ax.plot(np.arange(1, M.shape[0]+1), U[:,i], label = top_words_pca(M, W, i, 8))

ax.set(xlabel = "chapter", ylabel = "topic weight")

ax.legend(bbox_to_anchor=(1, -.15))

Limitations of PCA

PCA is a fantastically powerful algorithm in a wide variety of settings, but it does have some important limitations. In this case, it can be difficult for us to interpret the results of PCA because some of the topics can be “negatively involved” in a chapter. That is, the words in that topic are “especially absent” from the chapter. This is confusing at best. So, we have a mismatch between what this specific matrix factorization method does and our interpretability needs. This is a very common situation, and is one of the most frequent motivators of new methods.

For more on PCA, including some settings in which it is more effective, see this section of the Python Data Science Handbook by Jake VanderPlas.

Nonnegative Matrix Factorization

Nonnegative matrix factorization (NMF) is an algorithm that is explicitly designed to compensate for this problem. In NMF, we instead solve a different matrix factorization problem. The core of the problem is:

I call this the “core” of the problem because there are also usually other terms minimized that correspond to different kinds of regularization.

\[ \begin{aligned} \hat{\mathbf{U}}, \hat{\mathbf{W}} =& \mathop{\mathrm{arg\,min}}_{\mathbf{U}\in \mathbb{R}^{n \times k}, \mathbf{W}\in \mathbb{R}^{k\times p}} \lVert \mathbf{X}- \mathbf{U}\mathbf{W} \rVert_{F} \\ &\text{Such that } \mathbf{U}, \mathbf{W}\geq \mathbf{0}\; . \end{aligned} \]

That is, both the matrices \(\mathbf{U}\) and \(\mathbf{W}\) are required to be nonnegative in all their entries. This ensures that the results will be more interpretable, at the cost of making it much harder to actually solve the problem. Implementations of NMF usually require modified relatives of gradient descent in order to run.

Let’s see NMF in action on our data:

from sklearn.decomposition import NMF

model = NMF(n_components = k)
model.fit(M)

NMF(n_components=4)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

As before, we can extract the most important words for each topic:

def top_words_NMF(M, model, component, num_words):
    orders = np.argsort(model.components_, axis = 1)
    important_words = np.array(M.columns)[orders]
    return important_words[component][-num_words:]

for i in range(k):
  print(top_words_NMF(M, model, i, 10))

['whiting' 'course' 'join' 'sea' 'beautiful' 'dance' 'soup' 'gryphon'
 'mock' 'turtle']
['asleep' 'treacle' 'table' 'king' 'twinkle' 'tea' 'hare' 'march'
 'dormouse' 'hatter']
['hedgehog' 'game' 'majesty' 'gardeners' 'court' 'soldiers' 'jury' 'cat'
 'king' 'queen']
['cats' 'dinah' 'eat' 'baby' 'mad' 'dodo' 'footman' 'caterpillar' 'cat'
 'mouse']

And plot the role of each topic in each chapter:

U = model.transform(M)
fig, ax = plt.subplots(1)

for i in range(4):
    ax.plot(df['chapter'], U[:,i], label = top_words_NMF(M, model, i, 8))

ax.legend(bbox_to_anchor=(1, -.15))

This is much more interpretable! We can easily see several major features of the plot of the novel, including the tea party with the March Hare, the Mad Hatter, and the Dormouse (Chapter 7), the crocquet game in the court of the Queen of Hearts (Chapter 8), the appearance of the Mock Turtle and the Lobster (Chapters 9 and 10), and the reappearance of many characters in Chapter 11.