More Classifiers and More Labels

Author

Phil Chodrow

In this set of notes, we’ll introduce a few new classifiers at a high level, including classifiers that go beyond the framework of convex linear models.

Recap

So far, we’ve focused on the framework of empirical risk minimization for convex linear models that address the binary classification task. Today, we’re going to (a) look at classification beyond binary labels and (b) briefly discuss some examples of classification models that are neither convex nor linear.

Recall that in our setting of convex linear models for binary classification, we consider the problem of minimizing the following function:

\[ L(\mathbf{w}) = \sum_{i = 1}^n \ell(\langle \mathbf{w}, \phi(\mathbf{x}_i) \rangle, y_i) \]

Here,

\(\mathbf{X}\in \mathbb{R}^{n\times p}\) is the feature matrix. There are \(n\) distinct observations, encoded as rows. Each of the \(p\) columns corresponds to a feature: something about each observation that we can measure or infer. Each observation is written \(\mathbf{x}_1, \mathbf{x}_2,\ldots\). \[ \mathbf{X}= \left[\begin{matrix} & - & \mathbf{x}_1 & - \\ & - & \mathbf{x}_2 & - \\ & \vdots & \vdots & \vdots \\ & - & \mathbf{x}_{n} & - \end{matrix}\right] \]
\(\mathbf{y}\in \mathbb{R}^{n}\) is the target vector. The target vector gives a label, value, or outcome for each observation.
\(\phi\) is a feature map and \(\ell\) is a convex per-observation loss function.

We’ve studied where this framework comes from and how to solve the empirical risk minimization problem \[ \hat{\mathbf{w}} = \mathop{\mathrm{arg\,min}}_{\mathbf{w}} L(\mathbf{w})\;. \] using gradient descent, in which we perform the iteration \[ \hat{\mathbf{w}}^{(t+1)} \gets \hat{\mathbf{w}}^{(t)} - \alpha \nabla L(\mathbf{w}^{(t)}) \] until convergence. Assuming that our per-observation loss function is convex (as it is, for example, in logistic regression), gradient descent will always converge to the globally optimal \(\hat{\mathbf{w}}\) (although it might do so slowly).

Modifications for the Multiclass Setting

Multiple Class Labels

So far, we’ve treated binary classification, especially in the setting where the labels \(y \in \{0,1\}\). We’d like to do multiclass classification, where, for example, \(y \in \{0, 1, 2\}\). This is the setting, for example, that you encounter in the blog post on penguin classification. The transition from binary classification to multiple class labels is not too complex, if we allow ourselves to think of the target label \(y\) as encoding a target vector \(\tilde{\mathbf{y}}\) with zeros in all entries except the \(y\)th entry. Let \(k\) be the number of possible classes. Then, if \(k = 3\) and \(y = 1\), then \(\tilde{\mathbf{y}} = (0, 1, 0)\). For this to work, we have to make a few other modifications as well:

This is often called one-hot encoding.

Prediction Vectors

Our prediction model \(f(\mathbf{x})\) can’t just spit out a real number any more – it needs to spit out something that we can compare with \(\tilde{\mathbf{y}}\). So, things like \(f(\mathbf{x}) = \langle \mathbf{w}, \mathbf{x} \rangle\) don’t work anymore! We usually assume that \(f:\mathbb{R}^p \rightarrow \mathbb{R}^k\), that is, \(\hat{\mathbf{y}} = f(\mathbf{x})\) is a vector of the same length as \(\tilde{\mathbf{y}}\). As one important example of this, we might assume that \[ f(\mathbf{x}) = \mathbf{W}\mathbf{x}\;, \] where now \(\mathbf{W}\in \mathbb{R}^{k \times p}\) is a matrix of weights.

This is a direct generalization of our previous setting: if \(f(\mathbf{x}) = \langle \mathbf{w}, \mathbf{x} \rangle\), then we can think of \(\mathbf{w}\) as being a \(p\times 1\) matrix.

Loss Function

We also need to modify our loss function so that we can compute things like \[ \ell(\hat{\mathbf{y}}, \tilde{\mathbf{y}}) \] when both \(\hat{\mathbf{y}}\) and \(\tilde{\mathbf{y}}\) are vectors. One common way we do this is via the categorical cross-entropy. First, define the softmax function \(\boldsymbol{\sigma}:\mathbb{R}^k\rightarrow \mathbb{R}^k\) by the formula

\[ \boldsymbol{\sigma}(\hat{\mathbf{y}})_h = \frac{e^{\hat{y}_h}}{\sum_{h' = 1}^k e^{\hat{y}_{h'}}}\;. \]

The vector \(\boldsymbol{\sigma}(\hat{\mathbf{y}})\) is a probability vector: all its entries are nonnegative and sum to 1. For convenience, write \(\hat{\mathbf{p}} = \boldsymbol{\sigma}(\hat{\mathbf{y}})\). Then, then categorical cross-entropy is

\[ \ell(\hat{\mathbf{y}}, \tilde{\mathbf{y}}) = -\sum_{h = 1}^k \tilde{y}_h \log \boldsymbol{\sigma}(\hat{\mathbf{y}})_h\;. \tag{1}\]

The categorical cross-entropy is a generalization of the logistic loss.

Multiclass Empirical Risk

We can now write the general empirical risk (not assuming linearity or convexity) as

\[ \sum_{i = 1}^n \ell(f(\mathbf{x}_i), \tilde{\mathbf{y}}_i)\;. \]

As usual, we’d like to find a prediction rule \(f\) that makes the empirical risk small, although we need to be aware of possible issues related to overfitting.

A Quick Tour of Classifiers

Multinomial Logistic Regression

In multinomial logistic regression, \(f(\mathbf{x}_i) = \mathbf{W}\mathbf{x}_i\) and the loss function is the categorical cross-entropy from Equation 1. An important feature of multinomial logistic regression is that it has linear decision boundaries.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

train_url = "https://raw.githubusercontent.com/middlebury-csci-0451/CSCI-0451/main/data/palmer-penguins/train.csv"
train = pd.read_csv(train_url)

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(train["Species"])

species = [s.split()[0] for s in le.classes_]

def prepare_data(df):
  df = df.drop(["studyName", "Sample Number", "Individual ID", "Date Egg", "Comments", "Region"], axis = 1)
  df = df[df["Sex"] != "."]
  df = df.dropna()
  y = le.transform(df["Species"])
  df = df.drop(["Species"], axis = 1)
  df = pd.get_dummies(df)
  return df, y

X_train, y_train = prepare_data(train)

from sklearn.linear_model import LogisticRegression
from mlxtend.plotting import plot_decision_regions

cols = ["Culmen Length (mm)", "Culmen Depth (mm)"]

def training_decision_regions(model, cols, **kwargs):
    m = model(**kwargs)
    m.fit(np.array(X_train[cols]), y_train)
    plot_decision_regions(np.array(X_train[cols]), y_train, clf = m)
    ax = plt.gca()
    ax.set(xlabel = cols[0], 
                  ylabel = cols[1], 
                  title = f"Training accuracy = {m.score(np.array(X_train[cols]), y_train).round(2)}")

    handles, labels = ax.get_legend_handles_labels()
    ax.legend(handles, 
              species, 
              framealpha=0.3, 
              scatterpoints=1)

training_decision_regions(LogisticRegression, cols)

If we fit an individual logistic regression model, we’ll be able to see how its predictions work:

LR = LogisticRegression()
LR.fit(X_train[cols], y_train)

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now that we’ve fit the model, we can inspect the weight matrix:

w_hat = LR.coef_
w_hat

array([[-1.02133384,  2.03824239],
       [ 0.21649256,  0.51465801],
       [ 0.80484128, -2.5529004 ]])

This weight matrix multiplies the feature matrix to get the prediction matrix \(\hat{\mathbf{Y}}\).

y_hat = np.array(X_train[cols])@w_hat.T

The built-in method LR.predict_proba will compute the predictions after having passed them through the softmax function. The advantage of this is that we can interpret each entry as the probability of class membership:

P = LR.predict_proba(X_train[cols])

P[0,:] # guess is category 2, Gentoo

array([4.32383911e-06, 5.37250378e-03, 9.94623172e-01])

Here’s a heatmap of the first 20 individuals and their predicted labels. Brighter yellow means greater predicted probability of belonging to the specified class.

p = plt.imshow(P[0:20,:].T)

Almost all of the individuals are clearly predicted in just one of the classes, while the model is less confident about the membership of the penguin with index 10.

Support Vector Machine

The support vector machine classification problem for binary classification is a convex linear model in which we use the so-called hinge loss. In the notation from our previous lectures, it can be written like this:

\[ \hat{\mathbf{w}} = \mathop{\mathrm{arg\,min}}_{\mathbf{w}} \left[\sum_{i = 1}^n \max \{1 - y_i \langle \mathbf{w}, \mathbf{x}_i \rangle, 0\} + \frac{1}{2C}\sum_{\ell = 1}^p w_\ell^2\right]\;. \]

Mathematically, the support vector machine is an exceptionally beautiful algorithm, primarily because it admits a “kernel trick.” The kernel trick allows us to use infinite-dimensional nonlinear features for prediction, which can significantly enhance the expressive power of our models. To my knowledge, unfortunately, the support vector machine doesn’t handle multiclass classification very well. What scikit-learn does is split the problem into a sequence of binary problems (“blue or not blue”) to obtain the final result. Here’s an example:

For more on the kernel trick, see Hardt and Recht, p. 58-62.

from sklearn.svm import SVC
training_decision_regions(SVC, cols, kernel = "rbf", gamma = .1)

Here, the rbf kernel can be changed according to user preferences. gamma controls how wiggly the decision boundary is allowed to be:

from sklearn.svm import SVC
training_decision_regions(SVC, cols, kernel = "rbf", gamma = 10)

Cross-validation or other tools should be used in order to determine a value of \(\gamma\) that has good expressive power while avoiding overfitting.

Multilayer Perceptron

Logistic regression and support vector machine are both still in the convex linear model framework. Let’s now move beyond this framework for the first time. We’ll consider

A new nonconvex linear model.
A nonconvex nonlinear model.

We’ve already seen a nonconvex linear model: perceptron! To create a more useful one, let’s consider the following idea: we’re going to just stack logistic regressions on top of each other, like this: \[ \mathbf{Z}= \boldsymbol{\sigma}(\mathbf{X}\mathbf{W}) \]

That is, the matrix \(\mathbf{Z}\) is the result of computing the matrix product \(\mathbf{X}\mathbf{W}\) and then applying the softmax function row-wise. If \(\mathbf{W}\) is \(p\times \ell\), then \(\mathbf{X}\mathbf{W}\) is an \(n\times \ell\) matrix, as is \(\mathbf{Z}\). This is essentially multinomial logistic regression. Now, here’s the thing: what if we just used \(\mathbf{Z}\) as the input to another logistic regression? That is, we compute \[ \hat{\mathbf{Y}} = \boldsymbol{\sigma}(\mathbf{Z}\mathbf{W}')\;, \]

where \(\mathbf{W}'\) is a new matrix of weights and \(\hat{\mathbf{Y}}\) is our matrix of predictions that we will assess using the categorical cross-entropy or another such function. Then, the empirical risk minimization problem is \[ \hat{\mathbf{W}}, \hat{\mathbf{W}}' = \mathop{\mathrm{arg\,min}}_{\mathbf{W}, \mathbf{W}'} \sum_{i = 1}^n \ell(\boldsymbol{\sigma}(\boldsymbol{\sigma}(\mathbf{X}\mathbf{W})\mathbf{W}')_i, \tilde{\mathbf{y}}_i) \;. \]

This problem is no longer convex, but we can still try to optimize it with gradient descent.

We often call the computation of \(\mathbf{Z}\) a hidden layer because it is neither the feature matrix \(\mathbf{X}\) nor the target \(\tilde{\mathbf{y}}\). So, we have created a model with a single hidden layer. The idea of stacking together simple linear transformations with simple nonlinearities is the fundamental idea of modern deep learning.

scikit-learn implements models like this under the heading of “multilayer perceptron” (the name is mostly historical). We can create a multilayer perceptron like this:

from sklearn.neural_network import MLPClassifier

training_decision_regions(MLPClassifier, cols, activation = "logistic", hidden_layer_sizes = (100, 100, 100))

We observe apparently linear decision boundaries in the data set this time, although in principle the model could also have generated nonlinear boundaries.

Decision Tree Classifiers

Decision tree classifiers still do empirical risk minimization, but they are both nonlinear and nonconvex. The best way to see what a decision tree classifier does is to train one and visualize it:

from sklearn.tree import DecisionTreeClassifier, plot_tree

DTC = DecisionTreeClassifier(max_depth = 3)
DTC.fit(X_train[cols], y_train)

p = plot_tree(DTC, feature_names = cols, filled = True, class_names = species)

We observe that the decision tree works by making a sequence of decisions that sort the data into progressively finer buckets. You can implement a decision tree as nothing more than a sequence of nested if-else statements, although the algorithms to actually train them can be trickier. The decision regions for decision trees look “boxy,” composed of vertical and horizontal segments:

training_decision_regions(DecisionTreeClassifier, cols, max_depth = 3)

Decision trees are very flexible models, but it is easy for them to overfit if the depth is too high:

training_decision_regions(DecisionTreeClassifier, cols, max_depth = 10)

For this reason, it is common to choose the depth through cross validation:

from sklearn.model_selection import cross_val_score
np.random.seed(12345)

fig, ax = plt.subplots(1)

for d in range(2, 10):
    T = DecisionTreeClassifier(max_depth = d)
    m = cross_val_score(T, X_train[cols], y_train, cv = 10).mean()
    ax.scatter(d, m, color = "black")
    # ax.scatter(d, T.score(X_test, y_test), color = "firebrick")

labs = ax.set(xlabel = "Complexity (depth)", ylabel = "Performance (score)")

It looks like a depth of roughly 6 might be about right for this data set:

training_decision_regions(DecisionTreeClassifier, cols, max_depth = 6)

Random Forest

A random forest is essentially a collection of many decision trees that have been trained on random subsets of the data. Random forest classifiers have some very good properties that help them be fairly resistent to overfitting – they usually work pretty well “out of the box.”

from sklearn.ensemble import RandomForestClassifier
training_decision_regions(RandomForestClassifier, cols)

A Wide World of Classifiers

There are many other classification algorithms. Which algorithm to use in a specific case depends on things like:

How much computational power do I have for the training stage?
How much computational power do I have each time I make a prediction?
Is the mathematical structure of the classifier well-aligned to my data?

For relatively small data sets, it’s often possible to simply use cross-validation scores or similar metrics in order to choose between competing classifiers.