import numpy as np
from matplotlib import pyplot as plt
def pad(X):
return np.append(X, np.ones((X.shape[0], 1)), 1)
def LR_data(n_train = 100, n_val = 100, p_features = 1, noise = .1, w = None):
if w is None:
= np.random.rand(p_features + 1) + .2
w
= np.random.rand(n_train, p_features)
X_train = pad(X_train)@w + noise*np.random.randn(n_train)
y_train
= np.random.rand(n_val, p_features)
X_val = pad(X_val)@w + noise*np.random.randn(n_val)
y_val
return X_train, y_train, X_val, y_val
Implementing Linear Regression
In this blog post, you’ll implement least-squares linear regression, and experiment with LASSO regularization for overparameterized problems.
$$
1 Implement Linear Regression Two Ways
To start this blog post, please implement least-squares linear regression in two ways.
- First, use the analytical formula for the optimal weight vector \(\hat{\mathbf{w}}\) from the lecture notes. This formula requires matrix inversion and several matrix multiplications.
- Next, use the formula for the gradient of the loss function to implement gradient descent for linear regression. You can still pass in
max_iter
andalpha
(the learning rate) as parameters to thefit
method.
In addition to the fit
method, your implementation should include a score
method (see below) and a predict
method (just return \(\mathbf{X}\mathbf{w}\)).
It’s fine for you to either define separate methods like fit_analytic
and fit_gradient
for these methods. It’s also fine to define a single fit
method with a method
argument to determine which algorithm is used.
As usual, place your implementation in a source file where you will be able to implement it.
The Score
Let \(\bar{y} = \frac{1}{n} \sum_{i = 1}^ny_i\). Then, the score we’ll use is the so-called coefficient of determination, which is
\[ c = 1 - \frac{\sum_{i = 1}^n (\hat{y}_i - y_i)^2}{\sum_{i = 1}^n (\bar{y} - y_i)^2}\;. \]
The coefficient of determination is always no larger than 1, with a higher value indicating better predictive performance. It can be arbitrarily negative for very bad models. Note that the numerator in the fraction is just \(L(\mathbf{w})\), so making the loss small makes the coefficient of determination large.
Efficient Gradient Descent
For gradient descent, please implement a score_history
so that you can visualize the value of the score
over epochs.
The formula for the gradient is \[ \nabla L(\mathbf{w}) = 2\mathbf{X}^T(\mathbf{X}\mathbf{w}- \mathbf{y})\;. \] However, you should resist the urge to compute this formula “from scratch” at every iteration. The reason is that the matrix multiplication \(\mathbf{X}^T\mathbf{X}\) has time-complexity \(O(np^2)\), where \(n\) is the number of data points and \(p\) is the number of features. Similarly, the matrix-vector product \(\mathbf{X}^T\mathbf{y}\) has time-complexity \(O(np)\). Both of these can be pretty expensive if you have a lot of data points! Fortunately, they don’t depend on the current value of \(w\), so you can actually just precompute them:
- Once during the
fit
method, compute \(\mathbf{P}= \mathbf{X}^T \mathbf{X}\) and \(\mathbf{q}= \mathbf{X}^T \mathbf{y}\). - The gradient is then \(\nabla L(\mathbf{w}) = 2(\mathbf{P}\mathbf{w}- \mathbf{q})\).
Computing \(\mathbf{P}\mathbf{w}\) requires only \(O(p^2)\) steps. In other words, precomputing \(\mathbf{P}\) and \(\mathbf{q}\) eliminates the dependence of the runtime on the number of data points – not bad!
2 Demo
The following function will create both testing and validation data that you can use to test your implementation:
Here’s an example of how to use the function to generate data. Unfortunately, it’s only possible to easily visualize this problem when p_features = 1
.
= 100
n_train = 100
n_val = 1
p_features = 0.2
noise
# create some data
= LR_data(n_train, n_val, p_features, noise)
X_train, y_train, X_val, y_val
# plot it
= plt.subplots(1, 2, sharex = True, sharey = True)
fig, axarr 0].scatter(X_train, y_train)
axarr[1].scatter(X_val, y_val)
axarr[= axarr[0].set(title = "Training", xlabel = "x", ylabel = "y")
labs = axarr[1].set(title = "Validation", xlabel = "x")
labs plt.tight_layout()
Once you’ve impmlemented your solution, you should be able to use it like this:
from solutions.linear_regression import LinearRegression
= LinearRegression()
LR # I used the analytical formula as my default fit method
LR.fit(X_train, y_train)
print(f"Training score = {LR.score(X_train, y_train).round(4)}")
print(f"Validation score = {LR.score(X_val, y_val).round(4)}")
Training score = 0.3562
Validation score = 0.5122
The estimated weight vector \(\mathbf{w}\) is
LR.w
array([0.53369045, 0.33787198])
I can get the same value for \(\mathbf{w}\) using gradient descent (it would be even closer if we allowed more iterations).
= LinearRegression()
LR2
= "gradient", alpha = 0.01, max_iter = 1e2)
LR2.fit(X_train, y_train, method LR2.w
array([0.53504079, 0.33705831])
I can also see how the score changed over time. Because we’re not using stochastic gradient descent, the score should increase monotonically in each iteration.
plt.plot(LR2.score_history)= plt.gca().set(xlabel = "Iteration", ylabel = "Score") labels
Your implementation is likely correct when you are able to reproduce results that are similar to these (although small differences are no problem).
3 Experiments
Once you’ve demonstrated the behavior above, perform an experiment in which you allow p_features
, the number of features used, to increase, while holding n_train
, the number of training points, constant. Try increasing p_features
all the way to n_train - 1
. What happens to the training score? What happens to the validation score? I’d suggest showing these results on a nice plot in which the horizontal axis is the number of features, the vertical axis is the score, and the training/validation scores are shown in different colors.
p_features = n_train-1
to the existence of a solution of the equation \(\mathbf{X}\mathbf{w}= \mathbf{y}\). What do you know about the rank of \(\mathbf{X}\)? Remember that the number of columns in \(\mathbf{X}\) is actually p_features + 1
, since it’s still necessary to pad with a column of 1s.When discussing your findings, make sure to connect them to the idea of overfitting.
4 LASSO Regularization
The LASSO algorithm uses a modified loss function with a regularization term:
\[ L(\mathbf{w}) = \lVert \mathbf{X}\mathbf{w}- \mathbf{y} \rVert_2^2 + \alpha \lVert \mathbf{w}' \rVert_1\;. \]
Here, \(\mathbf{w}'\) is the vector composed of all the entries of \(\mathbf{w}\) excluding the very last entry. The 1-norm is defined as
\[ \lVert \mathbf{w}' \rVert_1 = \sum_{j = 1}^{p-1} \lvert w_j \rvert\;. \]
The effect of the regularizing term is to make the entries of the weight vector \(\mathbf{w}\) small. In fact, LASSO has a nice property: it tends to force entries of the weight vector to be exactly zero. This is very desirable in so-called overparameterized problems, when the number of features \(p\) is larger than the number of data points \(n\).
Implementing LASSO involves some more complicated mathematical optimization than we will discuss in this course, so instead we’ll use the implementation in scikit-learn
. You can import it like this:
from sklearn.linear_model import Lasso
= Lasso(alpha = 0.001) L
Here, alpha
controls the strength of the regularization (it’s not related to the learning rate in gradient descent). Let’s fit this model on some data and check the coefficients:
= n_train - 1
p_features = LR_data(n_train, n_val, p_features, noise)
X_train, y_train, X_val, y_val L.fit(X_train, y_train)
Lasso(alpha=0.001)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Lasso(alpha=0.001)
The score on the validation set is high, which might be different from what you found with pure linear regression.
L.score(X_val, y_val)
0.541222047393935
What You Should Do
Replicate the same experiment you did with linear regression, increasing the number of features up to or even past n_train - 1
, using LASSO instead of linear regression. You might want to experiment with a few values of the regularization strength alpha
. Comment on how your validation score compares to standard linear regression when the number of features used is large.
© Phil Chodrow, 2023