flowchart LR subgraph input direction TB x_1 x_2 end subgraph feature_map[features] f_1 f_2 end subgraph raw predictions[raw predictions/logits] y_hat_1 y_hat_2 y_hat_3 end subgraph label_preds[label probabilities] p_1 p_2 p_3 end x_1 & x_2 --> x x --phi--> f_1 & f_2 f_1 --w_1--> y_hat_1 f_2 --w_1--> y_hat_1 f_1 --w_2--> y_hat_2 f_2 --w_2--> y_hat_2 f_1 --w_3--> y_hat_3 f_2 --w_3--> y_hat_3 y_hat_1 & y_hat_2 & y_hat_3 --> softmax softmax --> p_1 & p_2 & p_3
Introduction to Deep Learning
$$
The Problem of Features
Let’s begin by recalling and slightly expanding the empirical risk minimization framework that we’ve developed throughout this course. In the simplest approach to empirical risk minimization, we began with a matrix of features \(\mathbf{X}\in \mathbb{R}^{n\times p}\) and a vector of targets \(\mathbf{y}\in \mathbb{R}^n\). We defined a linear predictor function \(\hat{y} = f(\mathbf{x}) = \langle \mathbf{w}, \mathbf{x} \rangle\) which we interpreted as producing predictions of the value of \(y\). We then defined a loss function \(\ell: \mathbb{R}\times \mathbb{R}\rightarrow \mathbb{R}\) that told us the quality of the prediction \(\hat{y}\) by comparing it to a true target \(y\). Our learning problem was to find \(\mathbf{w}\) by minimizing the empirical risk: the mean (or sum) of the risk across all data points:
\[ \hat{\mathbf{w}} = \mathop{\mathrm{arg\,min}}_{\mathbf{w}\in \mathbb{R}^p} \sum_{i = 1}^n \ell(\langle \mathbf{w}, \mathbf{x}_i \rangle, y_i)\;. \]
We solved this problem using gradient descent.
When we extended to the setting of multiple label classification (e.g. three penguin species), we modified this setup slightly. We let \(\mathbf{Y}\in \mathbb{R}^{n\times \ell}\) be a matrix of targets, where each of the \(\ell\) columns represents a possible category, \(y_{i\ell} = 1\) means that observation \(i\) has label \(\ell\). We then needed to replace the prediction rule \(f(\mathbf{x}) = \langle \mathbf{w}, \mathbf{x} \rangle\) with a matrix-vector multiplication \(f(\mathbf{x}) = \mathbf{x}_i\mathbf{W}\), where \(\mathbf{W}\in \mathbb{R}^{p \times \ell}\). This produced a vector \(\hat{\mathbf{y}}\) which we could compare to \(\mathbf{y}\) using an appropriately modified loss function.
\[ \hat{\mathbf{W}} = \mathop{\mathrm{arg\,min}}_{\mathbf{W}\in \mathbb{R}^{p \times \ell}} \sum_{i = 1}^n \ell(\mathbf{x}_i\mathbf{W}, \mathbf{y}_i)\;. \]
However, we soon discovered a limitation: this method can only discover linear patterns in data, e.g. linear decision boundaries for classification or linear trends for regression. But most interesting patterns in data are nonlinear. We addressed this limitation using feature engineering: define a feature map \(\phi: \mathbb{R}^p \rightarrow \mathbb{R}^{q}\) and then solve the modified problem
\[ \hat{\mathbf{W}} = \mathop{\mathrm{arg\,min}}_{\mathbf{W}\in \mathbb{R}^{q\times \ell}} \sum_{i = 1}^n \ell(\phi(\mathbf{x}_i)\mathbf{W}, \mathbf{y}_i)\;. \]
The Network View
We can express the process of obtaining a prediction from logistic regression using a computational graph. Here’s an example of a computational graph for logistic regression in which we have two input features and wish to perform 3-way classification by outputing for each input \(\mathbf{x}\) a vector \(\mathbf{p}\) of probabilities for each label:
Back to Recap
We have a problem that we never addressed in a fully satisfactory way. On the one hand, in order to describe complex, nonlinear patterns in our data, we want a lot of features. However, the more features we engineer, the more likely we are to encounter overfitting.
One of the most common approaches to this problem around 15 years ago was to engineer many features but regularize, which forced entries of \(\mathbf{w}\) to remain small. This is a useful approach that is still used today, especially in statistics and econometrics. However, when our data points are very complex (audio files, images, text), it might still be very difficult for us to manually engineer the right sets of features to use even in this approach.
The fundamental idea of deep learning is to take a different tack: instead of only learning \(\mathbf{w}\) in the empirical risk minimization problem, we are also going to learn the feature map \(\phi\). That is, we want to find
\[ \hat{\mathbf{W}}, \hat{\phi} = \mathop{\mathrm{arg\,min}}_{\mathbf{W}, \phi} \; \mathcal{L}(\phi(\mathbf{X})\mathbf{W}, \mathbf{Y})\;, \]
and our learned predictor is \(f(\mathbf{X}) = \hat{\phi}(\mathbf{X})\hat{\mathbf{W}}\).
The need to learn the feature map \(\phi\) as well as weights \(\mathbf{w}\) makes this problem much harder, both mathematically and computationally. A major enabler of the deep learning revolution has been the development of hardware that is up to the task (especially GPUs), as well as algorithms that make good use of this hardware.
Deep Penguin Classification
Let’s use our friends the penguins as a running example. When working with deep models we often need to construct more complex data structures in order to feed our models to the data. I’m hiding this complexity for today; we’ll discuss it in some more detail in a coming lecture.
Code
import torch
import pandas as pd
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
= {
target_ix_dict "Adelie Penguin (Pygoscelis adeliae)" : 0,
"Chinstrap penguin (Pygoscelis antarctica)" : 1,
"Gentoo penguin (Pygoscelis papua)" : 2
}
class PenguinsDataset(Dataset):
def __init__(self, train = True):
if train:
= "https://raw.githubusercontent.com/middlebury-csci-0451/CSCI-0451/main/data/palmer-penguins/train.csv"
url else:
= "https://raw.githubusercontent.com/middlebury-csci-0451/CSCI-0451/main/data/palmer-penguins/test.csv"
url
= pd.read_csv(url)
df = df.drop(["studyName", "Sample Number", "Individual ID", "Date Egg", "Comments", "Region"], axis = 1)
df = df[df["Sex"] != "."]
df = pd.get_dummies(df, columns = [
df "Island", "Stage", "Clutch Completion", "Sex"
])= df.dropna()
df
self.df = df
self.transform = lambda x: torch.tensor(x, dtype = torch.float32)
self.target_ix = lambda x: target_ix_dict[x]
self.target_transform = lambda x: torch.zeros(
3, dtype=torch.float).scatter_(dim=0, index=torch.tensor(x), value=1)
def __len__(self):
return self.df.shape[0]
def __getitem__(self, idx):
= self.df.drop(["Species"], axis = 1).iloc[idx,:]
features = self.df.iloc[idx,:]["Species"]
label = self.transform(features)
features
= self.target_ix(label)
label = self.target_transform(label)
label
= features.to(device)
features = label.to(device)
label
return features, label
Code
if torch.backends.mps.is_available():
= "cpu"
device elif torch.cuda.is_available():
= "cuda"
device else:
= "cpu" device
Here are our data sets. PenguinsDataset
is a custom class that I implemented above, while DataLoader
is a utility from torch
that automatically handles things like batching and randomization for gradient descent.
= PenguinsDataset()
train = DataLoader(train, batch_size=10, shuffle=True)
train_dataloader
= PenguinsDataset(False)
val = DataLoader(val, batch_size=10, shuffle=True) val_dataloader
Now we’re ready to define our model. To see how we do things in torch, let’s start by implementing logistic regression. To start, we need to create a predictor model that does the prediction step of logistic regression. If you’ll recall, this is nothing more than matrix multiplication. The nn.Linear
“layer” supplied by torch
implements this:
from torch import nn
class penguinLogistic(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Sequential(
14, 3) # (number of features, number of class labels)
nn.Linear(
)def forward(self, x):
return self.linear(x)
We can optimize this model using the cross-entropy loss and the Adam optimizer. As you’ll recall, logistic regression is nothing more than matrix multiplication plus the cross-entropy loss, so this is a logistic regression model!
= penguinLogistic()
model = nn.CrossEntropyLoss()
loss_fn
= 1e-4
learning_rate = torch.optim.Adam(model.parameters(), lr=learning_rate) optimizer
Now we need to actually perform the optimization. The torch
package gives us a lot of control over exactly how this happens, and we’ll go over the details in a future lecture.
Code
def training_loop(model, train_dataloader, val_dataloader, learning_rate, epochs):
= len(train_dataloader.dataset)
train_size = len(val_dataloader.dataset)
val_size
= []
train_loss_history = []
train_acc_history = []
val_loss_history = []
val_acc_history
for t in range(epochs):
= 0.0
train_loss = 0.0
train_acc = 0.0
val_loss = 0.0
val_acc
for batch, (X, y) in enumerate(train_dataloader):
optimizer.zero_grad()= model(X)
pred = loss_fn(pred, y)
fit += fit.item() / train_size
train_loss
+= (pred.argmax(dim = 1) == y.argmax(dim = 1)).sum() / train_size
train_acc
# Backpropagation
fit.backward()
optimizer.step()
+= [train_loss]
train_loss_history += [train_acc]
train_acc_history
for batch, (X, y) in enumerate(val_dataloader):
with torch.no_grad():
= model(X)
pred = loss_fn(pred, y)
fit += fit.item() / val_size
val_loss += (pred.argmax(dim = 1) == y.argmax(dim = 1)).sum() / val_size
val_acc
+= [val_loss]
val_loss_history += [val_acc]
val_acc_history
if t % 50 == 0:
print(f"epoch {t}: val_loss = {val_loss}, val_accuracy = {val_acc}")
return train_loss_history, train_acc_history, val_loss_history, val_acc_history
from matplotlib import pyplot as plt
def plot_histories(tlh, tah, vlh, vah):
= plt.subplots(1, 2, figsize = (6, 3))
fig, axarr
0].plot(tlh, label = "train")
axarr[0].plot(vlh, label = "validation")
axarr[0].legend()
axarr[0].set(xlabel = "epoch", title = "loss")
axarr[0].semilogy()
axarr[
1].plot(tah, label = "train")
axarr[1].plot(vah, label = "validation")
axarr[1].legend()
axarr[1].set(xlabel = "epoch", title = "accuracy") axarr[
Let’s go ahead and train!
= training_loop(model, train_dataloader, val_dataloader, 1e-4, 500)
tlh, tah, vlh, vah plot_histories(tlh, tah, vlh, vah)
epoch 0: val_loss = 45.5021510404699, val_accuracy = 0.4852941632270813
epoch 50: val_loss = 0.609782618634841, val_accuracy = 0.07352941483259201
epoch 100: val_loss = 0.3989190424189848, val_accuracy = 0.1764705926179886
epoch 150: val_loss = 0.19054848131011515, val_accuracy = 0.44117650389671326
epoch 200: val_loss = 0.06688327999675975, val_accuracy = 0.7352941036224365
epoch 250: val_loss = 0.03516546855954562, val_accuracy = 0.9264706373214722
epoch 300: val_loss = 0.026237546082805183, val_accuracy = 0.9705882668495178
epoch 350: val_loss = 0.02203846328398761, val_accuracy = 0.9852941632270813
epoch 400: val_loss = 0.01984664469080813, val_accuracy = 1.0000001192092896
epoch 450: val_loss = 0.015965097419479313, val_accuracy = 0.9852941632270813
We can see that our model is doing much better than we would expect from random guessing on this problem, although it may not be competitive with many of the models that you implemented in your analysis of the penguins data set. Further training or tweaks to parameters like batch sizes and learning rates could potentially help improve performance here.
Let’s try adding a hidden layer, as in Equation 1. To do this, we need to add a nonlinearity \(\alpha\) and more Linear
layers. The important points are:
- The first dimension of the first linear layer needs to match the number of features of the input.
- The final dimension of the last linear layer needs to match the number of possible labels.
- The final dimension of each linear layer needs to match the first dimension of the next layer.
These rules follow directly from the need to make all the matrix multiplications turn out right.
So, let’s try a model with a single layer of 100 hidden units:
from torch import nn
class penguinNN(nn.Module):
def __init__(self):
super().__init__()
self.linear_relu_stack = nn.Sequential(
14, 100), # U_1
nn.Linear(# common choice of alpha these days
nn.ReLU(), 100, 3) # W
nn.Linear(
)
def forward(self, x):
return self.linear_relu_stack(x)
= penguinNN().to(device) model
We can optimize it using the same approach as before:
= torch.optim.Adam(model.parameters(), lr=learning_rate)
optimizer
= training_loop(model, train_dataloader, val_dataloader, 1e-4, 500)
tlh, tah, vlh, vah plot_histories(tlh, tah, vlh, vah)
epoch 0: val_loss = 10.191597097060258, val_accuracy = 0.3382353186607361
epoch 50: val_loss = 0.06825322088073282, val_accuracy = 0.7647059559822083
epoch 100: val_loss = 0.047590394668719345, val_accuracy = 0.867647111415863
epoch 150: val_loss = 0.053128435331232404, val_accuracy = 0.7058823704719543
epoch 200: val_loss = 0.03455903118147569, val_accuracy = 0.9264706373214722
epoch 250: val_loss = 0.0409484260222491, val_accuracy = 0.779411792755127
epoch 300: val_loss = 0.02161241366582758, val_accuracy = 0.9558823704719543
epoch 350: val_loss = 0.019268721122952068, val_accuracy = 0.9852941632270813
epoch 400: val_loss = 0.057865854571847355, val_accuracy = 0.6764705777168274
epoch 450: val_loss = 0.020513927235322842, val_accuracy = 0.9558823704719543
This model has also done well on this task, although still not closer to perfect accuracy. It’s also interesting to note that the model’s loss and accuracy both vary quite wildly during the training process. This is not uncommon, but may also point to a possible benefit of using a smaller step-size.
It might appear that we haven’t really gained much from the massive costs of mathematical structure and computational architecture – we could have done better with just tools from sklearn
, or even our own hand-built implementations! We’ll soon see the infrastructure working to our advantage when we get to problems involving large data sets with complex structure, like text and images.
© Phil Chodrow, 2023