import numpy as np
from matplotlib import pyplot as plt
= -0.5
w0 = 0.7
w1
= 100
n = np.random.rand(n, 1)
x = w1*x + w0 + 0.1*np.random.randn(n, 1)
y
plt.scatter(x, y)= plt.gca().set(xlabel = "Feature (x)", ylabel = "Target (y)") labels
Least-Squares Linear Regression
$$
So far in this course, we’ve focused exclusively on classification tasks: how to predict a categorical label for each data point. The other important task we need to consider is regression, in which we predict a real number for each data point based on its features. Here’s the stereotypical example:
Looking at this data, we can see an apparent linear trend that we would like to use in order to make prediction on new data points.
Mathematical Formulation
We’re going to focus on least-squares linear regression. The nice thing about least-squares linear regression is that it falls perfectly into our framework of convex linear models. In least-squares linear regression, we still make predictions of the form \(\hat{y}_i = \langle \mathbf{w}, \mathbf{x}_i \rangle\), since these are exactly linear predictions! The loss function is \(\ell(\hat{y}, y) = (\hat{y} - y)^2\), the squared error, which is convex. The empirical risk minimization problem is
\[ \begin{aligned} \hat{\mathbf{w}} &= \mathop{\mathrm{arg\,min}}_{\mathbf{w}} \; L(\mathbf{w}) \\ &= \sum_{i = 1}^n \ell(\hat{y}_i, y_i) \\ &= \mathop{\mathrm{arg\,min}}_{\mathbf{w}} \sum_{i = 1}^n \left(\langle \mathbf{w}, \mathbf{x}_i \rangle - y_i \right)^2\;. \end{aligned} \] It’s useful to write this in a more compact way using matrix-vector notation: the loss function \(L(\mathbf{w})\) can be written
\[ L(\mathbf{w}) = \lVert \mathbf{X}\mathbf{w}- \mathbf{y} \rVert_2^2\;. \]
So, we want to solve the problem
\[ \hat{\mathbf{w}} = \mathop{\mathrm{arg\,min}}_{\mathbf{w}} \; L(\mathbf{w}) = \mathop{\mathrm{arg\,min}}_{\mathbf{w}} \; \lVert \mathbf{X}\mathbf{w}- \mathbf{y} \rVert_2^2\;. \tag{1}\]
Solution Methods
There are a lot of ways to solve Equation 1. Let’s start by taking the gradient with respect to \(\hat{\mathbf{w}}\). Using the multivariate chain rule, this is
\[ \nabla L(\mathbf{w}) = 2\mathbf{X}^T(\mathbf{X}\mathbf{w}- \mathbf{y})\;. \tag{2}\]
One way to approach the linear regression problem is with gradient descent: repeat the iteration
\[ \mathbf{w}^{(t+1)} \gets \mathbf{w}^{(t)} - 2\alpha \mathbf{X}^T(\mathbf{X}\mathbf{w}^{(t)} - \mathbf{y}) \]
to convergence. As it turns out, there’s also an explicit formula involving a matrix inversion that we can obtain by using the condition \(\nabla L(\mathbf{w}) = \mathbf{0}\) which must hold at the minimum. Plugging in our expression for \(L(\mathbf{w})\), we get
\[ \mathbf{0}= \mathbf{X}^T(\mathbf{X}\hat{\mathbf{w}} - \mathbf{y})\;. \]
To start solving for \(\hat{\mathbf{w}}\), we can move \(\mathbf{X}^T\mathbf{y}\) to the other side:
\[ \mathbf{X}^T\mathbf{X}\hat{\mathbf{w}} = \mathbf{X}^T\mathbf{y}\;. \]
Now, provided that the matrix \(\mathbf{X}^T\mathbf{X}\) is of full rank, we can multiply both sides by \((\mathbf{X}^T\mathbf{X})^{-1}\) to obtain \[ \hat{\mathbf{w}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\;, \tag{3}\] which is an explicit formula for \(\hat{\mathbf{w}}\).
Let’s see if we can use this to compute predictions for our fake data above. In order for this formula to work, we need to ensure that \(\mathbf{X}\) is padded with a vector of ones.
def pad(X):
return np.append(X, np.ones((X.shape[0], 1)), 1)
= pad(x) X
Now we can use the formula:
= np.linalg.inv(X.T@X)@X.T@y w_hat
Let’s test this out on our fake data:
plt.scatter(x, y)
= np.linspace(0, 1, 101)[:,np.newaxis]
x_fake = pad(x_fake)
X_fake @w_hat, color = "black")
plt.plot(x_fake, X_fake= plt.gca().set(xlabel = "Feature (x)", ylabel = "Target (y)") labels
Not bad!
Scoring Linear Models
A good, simple way to score linear models is by using the loss function directly: smaller values are better! In scikit-learn
, regression models are instead scored using an affine transformation of the loss function called the coefficient of determination. The coefficient of determination is 1
when the model fits the data perfectly with no errors. It can be arbitrarily negative (e.g. -8.7
) if the model fits the data very poorly.
For a quick illustration, here’s the scikit-learn
implementation of linear regression:
from sklearn.linear_model import LinearRegression
= LinearRegression()
LR
LR.fit(x, y)
# training score
LR.score(x, y)
0.8587136683818445
Let’s evaluate our model on similar, unseen data:
= 100
n = np.random.rand(n, 1)
x_val = w1*x_val + w0 + 0.1*np.random.randn(n, 1) y_val
LR.score(x_val, y_val)
0.8290675457730805
As usual, gaps between the training and validation scores suggest the possibility of overfitting, although further investigation is required to see whether improvement on validation data is possible.
Incorporating Features
Sometimes (most of the time), the patterns we are looking for in our data are not actually linear. For example:
= 2*np.pi*np.random.rand(n, 1)
x = np.sin(x) + 0.2*np.random.randn(n, 1)
y
plt.scatter(x, y)
Just like in the case of classification, we can use feature maps to learn nonlinear patterns. For example:
from sklearn.preprocessing import PolynomialFeatures
= PolynomialFeatures(2)
p = p.fit_transform(x)
PHI 5] PHI[:
array([[1.00000000e+00, 2.38502900e+00, 5.68836335e+00],
[1.00000000e+00, 3.26013588e+00, 1.06284859e+01],
[1.00000000e+00, 1.60992471e-01, 2.59185759e-02],
[1.00000000e+00, 5.25277132e-02, 2.75916065e-03],
[1.00000000e+00, 4.18447039e+00, 1.75097924e+01]])
The feature matrix \(\Phi\) now plays the role of \(\mathbf{X}\), and we can use the same formula as earlier:
= np.linalg.inv(PHI.T@PHI)@PHI.T@y
w_hat w_hat
array([[ 0.93540932],
[-0.30331504],
[ 0.00414563]])
The predictions are
= PHI@w_hat y_hat
Let’s visualize:
plt.scatter(x, y)
= np.linspace(0, 2*np.pi, 1001)[:,np.newaxis]
x_lin = p.fit_transform(x_lin)
PHI_lin = PHI_lin@w_hat
y_trend
= "black") plt.plot(x_lin, y_trend, color
Hmmm, this is not a very impressive fit. Let’s wrap this process in a function and do a few experiments.
def poly_viz(deg, ax):
= PolynomialFeatures(deg)
p = p.fit_transform(x)
PHI = np.linalg.inv(PHI.T@PHI)@PHI.T@y
w_hat = np.linspace(x.min(), x.max(), 1001)[:,np.newaxis]
x_lin = p.fit_transform(x_lin)
PHI_lin = PHI_lin@w_hat
y_trend
ax.scatter(x, y)= "black") ax.plot(x_lin, y_trend, color
= plt.subplots(2, 5, figsize = (8, 4))
fig, axarr
for deg, ax in zip(range(1, 11), axarr.ravel()):
poly_viz(deg, ax)set(title = f"degree = {deg}")
ax.
plt.tight_layout()
As in classification, the use of polynomial features makes us susceptible to overfitting, and validation or cross-validation should be used in order to select a good degree.
Kernel Regression
Kernel methods offer a theoretically-grounded approach for dealing with nonlinearity in both classification and regression. In kernel methods, the feature vector corresponding to each data point \(\mathbf{x}_i\) is actually in terms of all the other points:
\[ \phi(\mathbf{x}_i) = \left(\begin{matrix} k(\mathbf{x}_i, \mathbf{x}_1) \\ k(\mathbf{x}_i, \mathbf{x}_2) \\ \cdots \\ k(\mathbf{x}_i, \mathbf{x}_n) \\ \end{matrix}\right) \]
Here, \(k:\mathbb{R}^{p} \times \mathbb{R}^{p} \rightarrow \mathbb{R}\) is a special function called a kernel function. Usually the kernel function is a measure of how similar two points are. A very common and useful kernel is the radial basis function (RBF) kernel
\[ k(\mathbf{x}_1, \mathbf{x}_2) = e^{-\gamma \lVert \mathbf{x}_1 - \mathbf{x}_2 \rVert^2}\;. \]
This function is largest when \(\mathbf{x}_1 = \mathbf{x}_2\), and decreases as these two vectors become farther apart. The idea of kernel methods is that the prediction can be expressed as a weighted sum of the target values at nearby points.
Kernel methods have extremely beautiful mathematics behind them and are fun to implement, but for today we can just show the implementation in scikit-learn
:
from sklearn.kernel_ridge import KernelRidge
= KernelRidge(kernel = "rbf", gamma = 1)
KR
KR.fit(x, y)
= np.linspace(x.min(), x.max(), 1001)[:,np.newaxis]
x_lin = KR.predict(x_lin)
y_lin
plt.scatter(x, y)= "black") plt.plot(x_lin, y_lin, color
Different values of gamma
will result in more or less “wiggly” fits. gamma
should be tuned using validation or cross validation in order to protect against overfitting.
Activity
The following code produces a random feature matrix \(\mathbf{X}\) and weight vector \(\mathbf{w}\):
= np.random.rand(10, 3)
X = pad(X)
X = np.random.rand(X.shape[1])
w
= X@w + np.random.randn(X.shape[0]) y
- Implement a
predict
function that computesy_hat
fromX
andw
(hint: don’t overthink it). - Implement a
score
function that acceptsX
,y
, andw
as arguments and computes the coefficient of determination of the predictions. The coefficient of determination is \[ c = 1 - \frac{\sum_{i = 1}^n (\hat{y}_i - y_i)^2}{\sum_{i = 1}^n (\bar{y} - y_i)^2}\;, \] where \(\bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i\).
You can modify these functions and use them in the blog post on linear regression.
© Phil Chodrow, 2023