Least-Squares Linear Regression

Author

Phil Chodrow

So far in this course, we’ve focused exclusively on classification tasks: how to predict a categorical label for each data point. The other important task we need to consider is regression, in which we predict a real number for each data point based on its features. Here’s the stereotypical example:

import numpy as np
from matplotlib import pyplot as plt

w0 = -0.5
w1 =  0.7

n = 100
x = np.random.rand(n, 1)
y = w1*x + w0 + 0.1*np.random.randn(n, 1)

plt.scatter(x, y)
labels = plt.gca().set(xlabel = "Feature (x)", ylabel = "Target (y)")

Looking at this data, we can see an apparent linear trend that we would like to use in order to make prediction on new data points.

Mathematical Formulation

We’re going to focus on least-squares linear regression. The nice thing about least-squares linear regression is that it falls perfectly into our framework of convex linear models. In least-squares linear regression, we still make predictions of the form \(\hat{y}_i = \langle \mathbf{w}, \mathbf{x}_i \rangle\), since these are exactly linear predictions! The loss function is \(\ell(\hat{y}, y) = (\hat{y} - y)^2\), the squared error, which is convex. The empirical risk minimization problem is

\[ \begin{aligned} \hat{\mathbf{w}} &= \mathop{\mathrm{arg\,min}}_{\mathbf{w}} \; L(\mathbf{w}) \\ &= \sum_{i = 1}^n \ell(\hat{y}_i, y_i) \\ &= \mathop{\mathrm{arg\,min}}_{\mathbf{w}} \sum_{i = 1}^n \left(\langle \mathbf{w}, \mathbf{x}_i \rangle - y_i \right)^2\;. \end{aligned} \] It’s useful to write this in a more compact way using matrix-vector notation: the loss function \(L(\mathbf{w})\) can be written

\[ L(\mathbf{w}) = \lVert \mathbf{X}\mathbf{w}- \mathbf{y} \rVert_2^2\;. \]

Reminder: \(\mathbf{X}\in \mathbb{R}^{n\times p}\), \(\mathbf{w}\in \mathbb{R}^{p}\), \(\mathbf{X}\mathbf{w}\in \mathbb{R}^n\), which is the same dimension as \(\mathbf{y}\).

So, we want to solve the problem

\[ \hat{\mathbf{w}} = \mathop{\mathrm{arg\,min}}_{\mathbf{w}} \; L(\mathbf{w}) = \mathop{\mathrm{arg\,min}}_{\mathbf{w}} \; \lVert \mathbf{X}\mathbf{w}- \mathbf{y} \rVert_2^2\;. \tag{1}\]

Solution Methods

There are a lot of ways to solve Equation 1. Let’s start by taking the gradient with respect to \(\hat{\mathbf{w}}\). Using the multivariate chain rule, this is

\[ \nabla L(\mathbf{w}) = 2\mathbf{X}^T(\mathbf{X}\mathbf{w}- \mathbf{y})\;. \tag{2}\]

One way to approach the linear regression problem is with gradient descent: repeat the iteration

\[ \mathbf{w}^{(t+1)} \gets \mathbf{w}^{(t)} - 2\alpha \mathbf{X}^T(\mathbf{X}\mathbf{w}^{(t)} - \mathbf{y}) \]

to convergence. As it turns out, there’s also an explicit formula involving a matrix inversion that we can obtain by using the condition \(\nabla L(\mathbf{w}) = \mathbf{0}\) which must hold at the minimum. Plugging in our expression for \(L(\mathbf{w})\), we get

\[ \mathbf{0}= \mathbf{X}^T(\mathbf{X}\hat{\mathbf{w}} - \mathbf{y})\;. \]

To start solving for \(\hat{\mathbf{w}}\), we can move \(\mathbf{X}^T\mathbf{y}\) to the other side:

\[ \mathbf{X}^T\mathbf{X}\hat{\mathbf{w}} = \mathbf{X}^T\mathbf{y}\;. \]

Now, provided that the matrix \(\mathbf{X}^T\mathbf{X}\) is of full rank, we can multiply both sides by \((\mathbf{X}^T\mathbf{X})^{-1}\) to obtain \[ \hat{\mathbf{w}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\;, \tag{3}\] which is an explicit formula for \(\hat{\mathbf{w}}\).

This requires that there are at least \(p\) linearly independent rows of \(\mathbf{X}\). In particular, \(\mathbf{X}\) must have at least as many rows as it has columns.

Let’s see if we can use this to compute predictions for our fake data above. In order for this formula to work, we need to ensure that \(\mathbf{X}\) is padded with a vector of ones.

def pad(X):
    return np.append(X, np.ones((X.shape[0], 1)), 1)

X = pad(x)

Now we can use the formula:

w_hat = np.linalg.inv(X.T@X)@X.T@y

Let’s test this out on our fake data:

plt.scatter(x, y)

x_fake = np.linspace(0, 1, 101)[:,np.newaxis]
X_fake = pad(x_fake)
plt.plot(x_fake, X_fake@w_hat, color = "black")
labels = plt.gca().set(xlabel = "Feature (x)", ylabel = "Target (y)")

Not bad!

Activity 1: Computational Complexity of Exact Least-Squares Regression

Multiplying a \(k\times \ell\) matrix with an \(\ell \times k\) matrix using the standard algorithm has time complexity \(k \ell^2\). Inverting a \(k\times k\) matrix (when the inverse exists) has time complexity \(k^3\). Left-multiplying a \(\k \times \ell\) matrix and an \(\ell\)-vector has time complexity \(k\ell\). The total time complexity of performing these operations is just the sum of the complexities of each individual one.

With these facts in mind, describe the time complexity of computing \(\hat{\mathbf{w}}\) using Equation 3 in terms of the number of data points \(n\) and the number of features \(p\). What would the computational bottleneck when the number of data points \(n\) is very large? What about what the number of features \(p\) is very large?

Reminder: \(\mathbf{X}\in \mathbb{R}^{n\times p}\) and \(\mathbf{y}\in \mathbb{R}^n\).As a reminder, we are talking about the formula \(\hat{\mathbf{w}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\;,\).
Activity 2: Computational Complexity of Gradient Descent

As you’ll implement in your blog post, gradient descent can actually be implemented via the iteration

\[ \mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \alpha (\mathbf{P}\mathbf{w}^{(t)} - \mathbf{q})\;, \]

where \(\mathbf{P}\) is a \(p \times p\) matrix and \(\mathbf{q}\) is a \(p\)-vector. What is the time complexity of a single iteration of gradient descent?

Of course, a full analysis of the time complexity of the gradient descent algorithm as a whole requires knowing how many iterations are necessary to achieve acceptable accuracy, which is a much harder problem.

The trick is to precompute some matrix products

Scoring Linear Models

A good, simple way to score linear models is by using the loss function directly: smaller values are better! In scikit-learn, regression models are instead scored using an affine transformation of the loss function called the coefficient of determination. The coefficient of determination is 1 when the model fits the data perfectly with no errors. It can be arbitrarily negative (e.g. -8.7) if the model fits the data very poorly.

For a quick illustration, here’s the scikit-learn implementation of linear regression:

from sklearn.linear_model import LinearRegression
LR = LinearRegression()
LR.fit(x, y)

# training score
LR.score(x, y)
0.8587136683818445

Let’s evaluate our model on similar, unseen data:

n = 100
x_val = np.random.rand(n, 1)
y_val = w1*x_val + w0 + 0.1*np.random.randn(n, 1)
LR.score(x_val, y_val)
0.8290675457730805

As usual, gaps between the training and validation scores suggest the possibility of overfitting, although further investigation is required to see whether improvement on validation data is possible.

Incorporating Features

Sometimes (most of the time), the patterns we are looking for in our data are not actually linear. For example:

x = 2*np.pi*np.random.rand(n, 1)
y = np.sin(x) + 0.2*np.random.randn(n, 1)

plt.scatter(x, y)

Just like in the case of classification, we can use feature maps to learn nonlinear patterns. For example:

from sklearn.preprocessing import PolynomialFeatures

p = PolynomialFeatures(2)
PHI = p.fit_transform(x)
PHI[:5]
array([[1.00000000e+00, 2.38502900e+00, 5.68836335e+00],
       [1.00000000e+00, 3.26013588e+00, 1.06284859e+01],
       [1.00000000e+00, 1.60992471e-01, 2.59185759e-02],
       [1.00000000e+00, 5.25277132e-02, 2.75916065e-03],
       [1.00000000e+00, 4.18447039e+00, 1.75097924e+01]])

The feature matrix \(\Phi\) now plays the role of \(\mathbf{X}\), and we can use the same formula as earlier:

w_hat = np.linalg.inv(PHI.T@PHI)@PHI.T@y
w_hat
array([[ 0.93540932],
       [-0.30331504],
       [ 0.00414563]])

The predictions are

y_hat = PHI@w_hat

Let’s visualize:

plt.scatter(x, y)


x_lin = np.linspace(0, 2*np.pi, 1001)[:,np.newaxis]
PHI_lin = p.fit_transform(x_lin)
y_trend = PHI_lin@w_hat

plt.plot(x_lin, y_trend, color = "black")

Hmmm, this is not a very impressive fit. Let’s wrap this process in a function and do a few experiments.

def poly_viz(deg, ax):
    p = PolynomialFeatures(deg)
    PHI = p.fit_transform(x)
    w_hat = np.linalg.inv(PHI.T@PHI)@PHI.T@y
    x_lin = np.linspace(x.min(), x.max(), 1001)[:,np.newaxis]
    PHI_lin = p.fit_transform(x_lin)
    y_trend = PHI_lin@w_hat
    ax.scatter(x, y)
    ax.plot(x_lin, y_trend, color = "black")
fig, axarr = plt.subplots(2, 5, figsize = (8, 4))

for deg, ax in zip(range(1, 11), axarr.ravel()): 
    poly_viz(deg, ax)
    ax.set(title = f"degree = {deg}")

plt.tight_layout()

As in classification, the use of polynomial features makes us susceptible to overfitting, and validation or cross-validation should be used in order to select a good degree.

Kernel Regression

Kernel methods offer a theoretically-grounded approach for dealing with nonlinearity in both classification and regression. In kernel methods, the feature vector corresponding to each data point \(\mathbf{x}_i\) is actually in terms of all the other points:

You can implement a kernel classification method in this blog post.

\[ \phi(\mathbf{x}_i) = \left(\begin{matrix} k(\mathbf{x}_i, \mathbf{x}_1) \\ k(\mathbf{x}_i, \mathbf{x}_2) \\ \cdots \\ k(\mathbf{x}_i, \mathbf{x}_n) \\ \end{matrix}\right) \]

Here, \(k:\mathbb{R}^{p} \times \mathbb{R}^{p} \rightarrow \mathbb{R}\) is a special function called a kernel function. Usually the kernel function is a measure of how similar two points are. A very common and useful kernel is the radial basis function (RBF) kernel

\[ k(\mathbf{x}_1, \mathbf{x}_2) = e^{-\gamma \lVert \mathbf{x}_1 - \mathbf{x}_2 \rVert^2}\;. \]

This function is largest when \(\mathbf{x}_1 = \mathbf{x}_2\), and decreases as these two vectors become farther apart. The idea of kernel methods is that the prediction can be expressed as a weighted sum of the target values at nearby points.

Kernel methods have extremely beautiful mathematics behind them and are fun to implement, but for today we can just show the implementation in scikit-learn:

from sklearn.kernel_ridge import KernelRidge

KR = KernelRidge(kernel = "rbf", gamma = 1)

KR.fit(x, y)

x_lin = np.linspace(x.min(), x.max(), 1001)[:,np.newaxis]
y_lin = KR.predict(x_lin)

plt.scatter(x, y)
plt.plot(x_lin, y_lin, color = "black")

Different values of gamma will result in more or less “wiggly” fits. gamma should be tuned using validation or cross validation in order to protect against overfitting.

Activity

The following code produces a random feature matrix \(\mathbf{X}\) and weight vector \(\mathbf{w}\):

X = np.random.rand(10, 3)
X = pad(X)
w = np.random.rand(X.shape[1])

y = X@w + np.random.randn(X.shape[0])
  1. Implement a predict function that computes y_hat from X and w (hint: don’t overthink it).
  2. Implement a score function that accepts X, y, and w as arguments and computes the coefficient of determination of the predictions. The coefficient of determination is \[ c = 1 - \frac{\sum_{i = 1}^n (\hat{y}_i - y_i)^2}{\sum_{i = 1}^n (\bar{y} - y_i)^2}\;, \] where \(\bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i\).

You can modify these functions and use them in the blog post on linear regression.



© Phil Chodrow, 2023