from sklearn.datasets import make_blobs, make_circles
from matplotlib import pyplot as plt
import numpy as np
12345)
np.random.seed(
= plt.subplots(1, figsize = (4, 4))
fig, ax = make_blobs(n_samples=100, n_features=2,
X, y =2, random_state=1)
centers
= ax.scatter(X[:, 0], X[:, 1])
a = ax.set(xlabel = "Feature 1", ylabel = "Feature 2") a
Clustering Data
$$
Clustering is a fundamental form of unsupervised learning in which the aim is to group data points into “similar” or “related” groups. In this way, it is superficially similar to classification. Suppose we have some data \(\mathbf{X}\) that, when we plot it, looks like this:
This data looks a lot like the kind of data that we used for classification. This time, however, we ignore \(\mathbf{y}\) (imagine our data didn’t come with any labels). We can still look at the plot and see that it apparently contains two groups or “clusters” of data. The clustering task is identify clusters that “fit” the data well, according to some criterion of “fit.” The k-means algorithm is one algorithm that attempts to perform this task. On this particular data, k-means does pretty well:
from sklearn.cluster import KMeans
= KMeans(n_clusters = 2)
km # NOTE: fit does not use y!!
km.fit(X)
= km.predict(X)
clusters
= plt.subplots(1, figsize = (4, 4))
fig, ax = ax.scatter(X[:, 0], X[:, 1], c = clusters, cmap = plt.cm.cividis)
a = ax.set(xlabel = "Feature 1", ylabel = "Feature 2") a
Note that what “pretty well” means here is subjective; because we don’t have any labels \(\mathbf{y}\), we can’t say that the clustering found by the model is “good” according to any objective criterion. Often we just have to eyeball the groups, or attempt to interpret them based on some prior knowledge we might have about the data.
K-Means Clustering
In the k-means algorithm, we divide the data into clusters by finding a cluster centroid for each. The centroid is the mean feature value in each cluster, and gives a summary of the location of the cluster. Suppose we have \(\ell\) clusters, and let \(\mathbf{m}_j \in \mathbb{R}^p\), \(j = 1,\ldots,\ell\) be the proposed cluster centroids, which we can collect into a matrix \(\mathbf{M}\in \mathbb{R}^{\ell \times p}\). To associate each data point \(\mathbf{x}_i\) to a cluster centroid \(\mathbf{m}_j\), we also posit a vector \(\mathbf{z}\in [\ell]^n\) of cluster labels, one for each point. If \(z_i = j\), this says that “point \(i\) belongs to cluster \(j\).”
Our aim is to find the cluster centroids \(\mathbf{m}_j\), \(j = 1,\ldots, \ell\) and labels \(\mathbf{z}\) that separate the data “well.” What does “well” mean? As usual, in order to define this, we need to define a loss function. We’re outside our usual supervised framework, so we need our loss function to come from a different place. The standard loss function for k-means comes from a simple intuition: a good clustering is one in which the points in each group are close to their group centroid. This leads us to the problem
\[ \hat{\mathbf{M}}, \hat{\mathbf{z}} = \mathop{\mathrm{arg\,min}}_{\mathbf{M},\;\mathbf{z}} \sum_{i = 1}^n \lVert \mathbf{x}_i - \mathbf{m}_{z_i} \rVert^2\;. \tag{1}\]
This is a nice mathematical formulation with a deep, dark secret: it’s not convex! This means that we can’t expect to find the “best” \(\hat{\mathbf{M}}\) and \(\hat{\mathbf{z}}\); we can only find “pretty good ones,” where “pretty good” means “locally optimal.” There’s a beautifully simple algorithm that we can use to perform this task:
import numpy as np
from sklearn.metrics import pairwise_distances
def k_means_step(X, M, z):
# compute the distances between all points and all centroids
= pairwise_distances(X, M)
D # each point's new group is its closest centroid
= np.argmin(D, axis = 1)
z # each centroid's new value is the mean of all the points in its group
for j in range(M.shape[0]):
= X[z == j].mean(axis = 0)
M[j,:] return M, z
def k_means(X, ell):
# random initialization
= X.shape
n, p = np.random.rand(ell, p)
M = np.random.randint(0, ell, n)
z_prev = False
done
# do k_means_step until z stops changing
while not done:
= k_means_step(X, M, z_prev)
M, z if np.all(z_prev == z):
= True
done = z
z_prev
# return the centroid matrix and cluster labels
return M, z
Let’s run this algorithm and plot the results.
= k_means(X, 2)
M, z
= plt.subplots(1, figsize = (4, 4))
fig, ax = ax.scatter(X[:, 0], X[:, 1], c = z, alpha = 0.4, cmap = plt.cm.cividis)
a = ax.set(xlabel = "Feature 1", ylabel = "Feature 2")
a
0], M[:,1], s = 50, color = "black") ax.scatter(M[:,
Looks ok! We have segmented the data into two clusters and found reasonable looking centroids, even though we didn’t begin with any true labels.
The following result is the fundamental theorem for k-means:
Laplacian Spectral Clustering
K-means is a useful algorithm with an important limitation: it works best when the data is approximately linearly separable! So, we wouldn’t expect k-means to do very well at all if we wanted to separate the two rings in a data set like this:
1234)
np.random.seed(
= 500
n = make_circles(n_samples=n, shuffle=True, noise=0.07, random_state=None, factor = 0.5)
X, y
= plt.subplots(1, figsize = (4, 4))
fig, ax = ax.scatter(X[:, 0], X[:, 1])
a = ax.set(xlabel = "Feature 1", ylabel = "Feature 2") a
Let’s check:
= k_means(X, 2)
M, z
= plt.subplots(1, figsize = (4, 4))
fig, ax = ax.scatter(X[:, 0], X[:, 1], c = z, alpha = 0.4, cmap = plt.cm.cividis)
a = ax.set(xlabel = "Feature 1", ylabel = "Feature 2")
a
0], M[:,1], s = 50, color = "black") ax.scatter(M[:,
You can see what k-means was going for here, but the result doesn’t distinguish the two features that most of us would pick out by eye.
We’ll now develop Laplacian spectral clustering, a clustering method that uses matrix eigenvectors to cluster the data. Spectral clustering is a very beautiful and effective technique for nonlinear clustering of data. It has two primary limitations: it is most effective for binary clustering, and it can be computationally expensive for larger data sets.
Nearest-Neighbors Graph
Laplacian clustering begins by computing the \(k\)-nearest-neighbors graph. In the \(k\)-nearest-neighbors graph, we draw a connecting edge between each point and the \(k\) points that are nearest to it in distance.
from sklearn.neighbors import NearestNeighbors
= 10
k = NearestNeighbors(n_neighbors=k).fit(X)
nbrs = nbrs.kneighbors_graph().toarray()
A
# symmetrize the matrix
= A + A.T
A > 1] = 1 A[A
Here, the i
th row of A
contains a 1
in each column j
such that j
is one of the five nearest neighbors of i
. The following function will draw this graph for us:
import networkx as nx
def plot_graph(X, A, z = None, ax = None, show_edge_cuts = True):
= nx.from_numpy_array(A)
G if z is None:
= X, alpha = .4, node_color = "grey", node_size = 20, ax = ax)
nx.draw_networkx(G, pos else:
if show_edge_cuts:
= ["red" if z[i] != z[j] else "grey" for i, j in G.edges()]
colors = [2 if z[i] != z[j] else 1 for i, j in G.edges()]
widths else:
= "black"
colors = 1
widths
= X, alpha = .4, node_color = z, node_size = 20, edge_color = colors, width = widths, ax = ax, cmap=plt.cm.cividis)
nx.draw_networkx(G, pos
set(xlabel = "Feature 1", ylabel = "Feature 2")
plt.gca().= plt.subplots(figsize = (4, 4))
fig, ax plot_graph(X, A)
We can now frame our task as one of finding the “pieces” of this graph. Defining a “piece” of a graph mathematically takes a bit of work. To think about this problem mathematically, let’s first define a label vector \(\mathbf{z}\in [0,1]^n\). As usual, our machine learning problem is actually a mathematical minimization problem: we want to define an objective function \(f\) such that \(f(\mathbf{z})\) is small when the labels of \(\mathbf{z}\) are “good.”
Cut-Based Clustering
A reasonable first pass at this problem is to define a clustering to be good when it doesn’t “cut” too many edges. An edge is “cut” if it has two nodes in different clusters. For example, here are two possible clusterings, with the edges cut by each clustering shown in red.
= plt.subplots(1, 2, figsize = (8, 4))
fig, axarr = np.random.randint(0, 2, n)
y_bad
= y, ax = axarr[0])
plot_graph(X, A, z = y_bad, ax = axarr[1]) plot_graph(X, A, z
The righthand plot has many more red edges, which connect nodes in two different proposed clusters. In contrast, the lefthand plot has many fewer.
Here are the cut values of these two clusterings:
from sklearn.neighbors import NearestNeighbors
def cut(A, z):
= pairwise_distances(z.reshape(-1, 1))
D return (A*D).sum()
print(f"good labels cut = {cut(A, z = y)}")
print(f"bad labels cut = {cut(A, z = y_bad)}")
good labels cut = 14.0
bad labels cut = 2950.0
So, it might seem as though a good approach would be to minimize the cut value of a label vector. Unfortunately, cut minimization has a fatal flaw: the best cut is always the one that assigns all nodes to the same label! This clustering always achieves a cut score of 0. So, we need to try something different.
Normalized Cut
To define a better objective, we need a little notation. The volume of cluster \(j\) in matrix \(\mathbf{A}\) with clustering vector \(\mathbf{z}\) is defined as
\[ \mathrm{vol}_{j}\left(\mathbf{A},\mathbf{z}\right) = \sum_{i = 1}^n \sum_{i' = 1}^n \mathbb{1}\left[ z_i = j \right] a_{ii'} \]
Heuristically, \(\mathrm{vol}_{j}\left(\mathbf{A},\mathbf{z}\right)\) is the number of edges that have one node in cluster \(j\). The normalized cut objective function is
\[ f(\mathbf{z}, \mathbf{A}) = \mathrm{cut}\left(\mathbf{A},\mathbf{z}\right)\left(\frac{1}{\mathrm{vol}_{0}\left(\mathbf{A},\mathbf{z}\right)} + \frac{1}{\mathrm{vol}_{1}\left(\mathbf{A},\mathbf{z}\right)}\right) \]
The normcut of our preferable clustering is still better than the random one:
def vol(j, A, z):
return A[z == j,:].sum()
def normcut(A, z):
return cut(A, z) * (1/vol(0, A, z) + 1/vol(1, A, z))
print(f"good labels normcut = {normcut(A, z = y)}")
print(f"bad labels normcut = {normcut(A, z = y_bad)}")
good labels normcut = 0.009632954990428188
bad labels normcut = 2.036335377012906
Spectral Approximation
Great! How do we choose \(\mathbf{z}\) to minimize the normcut?
Whoops! In fact, we can’t minimize the normcut exactly, so we need to do so approximately.
Here is the trick. Suppose that we have a binary clustering vector \(\mathbf{z}\). We’re going to associate \(\mathbf{z}\) with a modified clustering vector \(\tilde{\mathbf{z}}\) with entries
\[ \tilde{z}_i = \begin{cases} &\frac{1}{\mathrm{vol}_{0}\left(\mathbf{A},\mathbf{z}\right)} \quad z_i = 0 \\ - &\frac{1}{\mathrm{vol}_{1}\left(\mathbf{A},\mathbf{z}\right)} \quad z_i = 1\;. \end{cases} \]
A direct mathematical calculation then shows that we can write the normcut objective as \(f(\mathbf{z}) = \tilde{f}({\tilde{\mathbf{z}}})\), where
\[ \tilde{f}({\tilde{\mathbf{z}}}) = \frac{\tilde{\mathbf{z}}^T(\mathbf{D}- \mathbf{A})\tilde{\mathbf{z}}}{\tilde{\mathbf{z}}^T\mathbf{D}\tilde{\mathbf{z}}}\;, \tag{2}\]
where
\[ \mathbf{D}= \left[\begin{matrix} \sum_{i = 1}^n a_{i1} & & & \\ & \sum_{i = 1}^n a_{i2} & & \\ & & \ddots & \\ & & & \sum_{i = 1}^n a_{in} \end{matrix}\right]\;. \]
Additionally, we have \[ \tilde{\mathbf{z}}^T\mathbf{D}\mathbf{1}= 0\;, \tag{3}\] where \(\mathbf{1}\) is the vector containing all 1s. Heuristically, Equation 3 says that cluster 0 and cluster 1 contain roughly the same number of edges within them.
At this stage we do a cheat: we forget about the requirement that \(\tilde{\mathbf{z}}\) have the form we stated above. Instead, we simply solve the optimization problem
\[ \begin{aligned} \tilde{\mathbf{z}} =& \mathop{\mathrm{arg\,min}}_{\tilde{\mathbf{z}}} \frac{\tilde{\mathbf{z}}^T(\mathbf{D}- \mathbf{A})\tilde{\mathbf{z}}}{\tilde{\mathbf{z}}^T\mathbf{D}\tilde{\mathbf{z}}} \\ & \text{such that }\tilde{\mathbf{z}}^T\mathbf{D}\mathbf{1}= 0 \end{aligned} \]
A beautiful theorem from linear algebra states that there is an explicit solution to this problem: \(\mathbf{z}\) should be the eigenvector with the second-smallest eigenvalue of the matrix \(\mathbf{L}= \mathbf{D}^{-1}[\mathbf{D}- \mathbf{A}]\). This matrix \(\mathbf{L}\) is called the normalized Laplacian, and it is the reason that this clustering algorithm is called Laplacian spectral clustering.
from hidden.spectral import second_laplacian_eigenvector
= plt.subplots(figsize = (4, 4))
fig, ax = second_laplacian_eigenvector(A)
z_ = z_, show_edge_cuts = False, ax = ax) plot_graph(X, A, z
Darker shades correspond to nodes on which the eigenvector is negative, while lighter shades correspond to nodes on which the eigenvector is positive. We can get a final set of cluster labels just by assigning nodes to the cluster depending on the sign of the estimated \(\tilde{\mathbf{z}}\):
= z_ > 0
z = plt.subplots(figsize = (4, 4))
fig, ax = True, ax = ax) plot_graph(X, A, z, show_edge_cuts
Our method of distinguishing the two rings using Laplacian spectral clustering seems to have worked! Let’s try with another data set:
from sklearn.datasets import make_moons
= make_moons(n_samples=100, random_state=1, noise = .1)
X, z = plt.subplots(figsize = (4, 4))
fig, ax = ax.scatter(X[:, 0], X[:, 1])
a = ax.set(xlabel = "Feature 1", ylabel = "Feature 2") a
Now we’ll use a complete implementation of Laplacian spectral clustering, which you can complete in an optional blog post.
from hidden.spectral import spectral_cluster
= plt.subplots(figsize = (4, 4))
fig, ax = spectral_cluster(X, n_neighbors = 6)
z = ax.scatter(X[:, 0], X[:, 1], c = z, cmap = plt.cm.cividis)
a = ax.set(xlabel = "Feature 1", ylabel = "Feature 2") a
Looks pretty good! Note, however, that we chose a pretty specific value of n_neighbors
, the number of neighbors to use when forming the nearest-neighbors graph. This is a hyperparameter that needs to be tuned in some way. Unfortunately, cross-validation isn’t really an option here, as we don’t have a loss function or training data to use for validation purposes.
The performance of Laplacian spectral clustering can depend pretty strongly on this parameter. In this data set, small values can lead to oddly fragmented clusters, while larger values lead to results that don’t look too different from what we might expect from k-means
:
= plt.subplots(2, 3, figsize = (6, 4))
fig, axarr
= 2
i for ax in axarr.ravel():
= spectral_cluster(X, n_neighbors = i)
z = ax.scatter(X[:, 0], X[:, 1], c = z, cmap = plt.cm.cividis)
a = ax.set(xlabel = "Feature 1", ylabel = "Feature 2", title = f"{i} neighbors")
a += 1
i
plt.tight_layout()
As usual, there are ways to address these limitations, but these are beyond our scope for today.
© Phil Chodrow, 2023