So far in this course, we’ve considered the supervised learning framework. In the supervised framework, we have a matrix of predictive features \(\mathbf{X}\in \mathbb{R}^{n\times p}\) and a vector of targets \(\mathbf{y}\in \mathbb{R}^n\). Our aim is to find a function \(f:\mathbb{R}^p\rightarrow \mathbb{R}\) such that \(f(\mathbf{x}) \approx y\); that is, \(f\) can be used to make reasonable predictions of a new target \(y\) on the basis of new predictors \(\mathbf{x}\).
Unsupervised learning refers to machine learning techniques in which we do not have access to any targets \(\mathbf{y}\). Instead, we only have the features \(\mathbf{X}\). In general, the aim of unsupervised learning is not to make predictions, but rather to “find some structure” in the features \(\mathbf{X}\).
Dimensionality Reduction
On the other hand, dimensionality reduction is the task of identifying similar or related features (columns of \(\mathbf{X}\)). This often allows us to identify patterns in the data that we wouldn’t be able to spot without algorithmic help. Dimensionality reduction is our topic for this lecture, and we’ll discuss clustering in the next one.
Matrix Factorization Methods
There are many approaches for dimensionality reduction. Here, we’re going to consider matrix factorization methods. Broadly speaking, in matrix factorization, we aim to write \(\mathbf{X}\approx \mathbf{U}\mathbf{W}\) for some matrices \(\mathbf{U}\) and \(\mathbf{W}\), usually subject to some constraints on these matrices. One constraint is direct: if \(\mathbf{X}\in \mathbb{R}^{n \times p}\), then we must have \(\mathbf{U}\in \mathbb{R}^{n \times k}\) and \(\mathbf{W}\in \mathbb{R}^{k\times p}\) for some value of \(k\). We sometimes think of \(k\) as a latent dimension for the problem, and aim to find a value of \(k \ll p\). The smaller \(k\) is, the more we have simplified our data, but small values of \(k\) may cause us to lose a lot of interesting structure that we might have wanted to capture.
To specify a matrix factorization model, we need to make three main choices:
What does “\(\approx\)” mean when we write \(\mathbf{X}\approx \mathbf{U}\mathbf{W}\)?
What choice of \(k\) will we use?
What other requirements will we place on \(\mathbf{U}\) and \(\mathbf{W}\)?
Principal Component Analysis
Principal component analysis (PCA) is a classical method in statistics for dimensionality reduction. Let’s answer each of the three questions above for PCA.
Approximation: the Frobenius Norm
In PCA, \(\mathbf{X}\approx \mathbf{U}\mathbf{W}\) means that \(\lVert \mathbf{X}- \mathbf{U}\mathbf{W} \rVert_F\) is small, where \(\lVert \cdot \rVert_F\) is the Frobenius norm. The Frobenius norm of a matrix is just the square root of the sum of its squared entries: \[
\lVert \mathbf{A} \rVert_F = \sqrt{\sum_{i = 1}^n \sum_{j = 1}^n a_{ij}^2}\;.
\]
Choice of \(k\)
For PCA, we leave the choice of \(k\) to the user. There are heuristic ways for choosing \(k\), but we won’t discuss them in this course.
Constraints on \(\mathbf{U}\) and \(\mathbf{W}\)
In PCA, we place a certain special structure on the factorization matrices \(\mathbf{U}\) and \(\mathbf{W}\) that help us to interpret them:
\(\mathbf{W}\in \mathbb{R}^{k \times p}\)
\(\mathbf{U}= \mathbf{X}\mathbf{W}^T\). Note that this ensures that \(\mathbf{U}\in \mathbb{R}^{n\times k}\).
\(\mathbf{W}\mathbf{W}^T = \mathbf{I}\).
These assumptions can be derived statistically from the problem of finding the subspace (spanned by \(\mathbf{W}\)) that maximizes the variance of the data when projected onto that subspace.
The PCA Model
With these choices, the PCA optimization problem becomes:
Now, we could aim to solve this problem with gradient descent (in the entries of \(\mathbf{W}\)) or some similar approach. This, however, is more complicated than needed. As it turns out, some nice theory from linear algebra gives us a formula for finding \(\mathbf{W}\) in terms of the eigenvalues of the matrix \(\mathbf{X}^T\mathbf{X}\). In particular:
Take a moment to convince yourself that \(\hat{\mathbf{W}}\) has dimensions \(k \times p\) as required.
where \(\mathbf{v}_i\) is the \(i\)th eigenvector of \(\mathbf{X}^T\mathbf{X}\).
Implementing PCA
Let’s go ahead and implement PCA. We’ll do this using a term-document matrix that I derived from Lewis Caroll’s famous book Alice’s Adventures in Wonderland. Each row of the data corresponds to a chapter of the book, and each column corresponds to a word that may appear in that chapter:
Code
import numpy as npimport pandas as pdfrom matplotlib import pyplot as pltimport nltkfrom nltk.corpus import gutenberg# nltk.download('gutenberg') # need to run oncefrom sklearn.feature_extraction.text import CountVectorizers = gutenberg.raw("carroll-alice.txt")chapters = s.split("CHAPTER")[1:]df = pd.DataFrame({"chapter" : range(1, len(chapters) +1),"text" : chapters})dfvec = CountVectorizer(max_df =0.5, min_df =0, stop_words ="english")counts = vec.fit_transform(df['text'])counts = counts.toarray()M = pd.DataFrame(counts, columns = vec.get_feature_names_out())
Now we’re ready to implement PCA. Because there is an explicit solution in terms of the eigenvalues of a matrix, our implementation can be very short.
This is a very slow implementation of PCA; faster methods involve numerical methods for computing singular values.
def pca(M, k): X = np.array(M) ev = np.linalg.eigh(X.T@X) # special eigensolver for symmetric matrices W = ev[1][:,-k:].T # matrix of top k eigenvectorsreturn X@W.T, W # U and W
Here’s PCA in action:
k =4U, W = pca(M, k)
Interpreting PCA
Ok, so why did we do this? Both the matrices \(\mathbf{U}\) and \(\mathbf{W}\) can be used to give us information about the structure of the text. In the context of text analysis, we usually interpret \(\mathbf{U}\) and \(\mathbf{W}\) as expressing information about some latent topics. The \(j\) th column of \(\mathbf{W}\) gives us a weight for each term in the text; this tells us how present each term is in the topic. The following function can be used to inspect the top words in each topic:
On the other hand, the matrix \(\mathbf{U}= \mathbf{X}\mathbf{W}^T\) tells us about how present each of the \(k\) topics are in each of the documents (in this case, chapters of the book). We can visualize the topic weight in each chapter over time:
fig, ax = plt.subplots(1)for i inrange(k): ax.plot(np.arange(1, M.shape[0]+1), U[:,i], label = top_words_pca(M, W, i, 8))ax.set(xlabel ="chapter", ylabel ="topic weight")ax.legend(bbox_to_anchor=(1, -.15))
Limitations of PCA
PCA is a fantastically powerful algorithm in a wide variety of settings, but it does have some important limitations. In this case, it can be difficult for us to interpret the results of PCA because some of the topics can be “negatively involved” in a chapter. That is, the words in that topic are “especially absent” from the chapter. This is confusing at best. So, we have a mismatch between what this specific matrix factorization method does and our interpretability needs. This is a very common situation, and is one of the most frequent motivators of new methods.
For more on PCA, including some settings in which it is more effective, see this section of the Python Data Science Handbook by Jake VanderPlas.
Nonnegative Matrix Factorization
Nonnegative matrix factorization (NMF) is an algorithm that is explicitly designed to compensate for this problem. In NMF, we instead solve a different matrix factorization problem. The core of the problem is:
I call this the “core” of the problem because there are also usually other terms minimized that correspond to different kinds of regularization.
That is, both the matrices \(\mathbf{U}\) and \(\mathbf{W}\) are required to be nonnegative in all their entries. This ensures that the results will be more interpretable, at the cost of making it much harder to actually solve the problem. Implementations of NMF usually require modified relatives of gradient descent in order to run.
Let’s see NMF in action on our data:
from sklearn.decomposition import NMFmodel = NMF(n_components = k)model.fit(M)
NMF(n_components=4)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
NMF(n_components=4)
As before, we can extract the most important words for each topic:
def top_words_NMF(M, model, component, num_words): orders = np.argsort(model.components_, axis =1) important_words = np.array(M.columns)[orders]return important_words[component][-num_words:]for i inrange(k):print(top_words_NMF(M, model, i, 10))
U = model.transform(M)fig, ax = plt.subplots(1)for i inrange(4): ax.plot(df['chapter'], U[:,i], label = top_words_NMF(M, model, i, 8))ax.legend(bbox_to_anchor=(1, -.15))
This is much more interpretable! We can easily see several major features of the plot of the novel, including the tea party with the March Hare, the Mad Hatter, and the Dormouse (Chapter 7), the crocquet game in the court of the Queen of Hearts (Chapter 8), the appearance of the Mock Turtle and the Lobster (Chapters 9 and 10), and the reappearance of many characters in Chapter 11.