Implementing Linear Regression

In this blog post, you’ll implement least-squares linear regression, and experiment with LASSO regularization for overparameterized problems.

Published

March 15, 2023

1 Implement Linear Regression Two Ways

To start this blog post, please implement least-squares linear regression in two ways.

First, use the analytical formula for the optimal weight vector \(\hat{\mathbf{w}}\) from the lecture notes. This formula requires matrix inversion and several matrix multiplications.
Next, use the formula for the gradient of the loss function to implement gradient descent for linear regression. You can still pass in max_iter and alpha (the learning rate) as parameters to the fit method.

Implementing stochastic gradient descent would be nice thing to do but not required. Only if you want to go above and beyond!

In addition to the fit method, your implementation should include a score method (see below) and a predict method (just return \(\mathbf{X}\mathbf{w}\)).

It’s fine for you to either define separate methods like fit_analytic and fit_gradient for these methods. It’s also fine to define a single fit method with a method argument to determine which algorithm is used.

As usual, place your implementation in a source file where you will be able to implement it.

The Score

Let \(\bar{y} = \frac{1}{n} \sum_{i = 1}^ny_i\). Then, the score we’ll use is the so-called coefficient of determination, which is

\[ c = 1 - \frac{\sum_{i = 1}^n (\hat{y}_i - y_i)^2}{\sum_{i = 1}^n (\bar{y} - y_i)^2}\;. \]

The coefficient of determination is always no larger than 1, with a higher value indicating better predictive performance. It can be arbitrarily negative for very bad models. Note that the numerator in the fraction is just \(L(\mathbf{w})\), so making the loss small makes the coefficient of determination large.

Efficient Gradient Descent

For gradient descent, please implement a score_history so that you can visualize the value of the score over epochs.

The formula for the gradient is \[ \nabla L(\mathbf{w}) = 2\mathbf{X}^T(\mathbf{X}\mathbf{w}- \mathbf{y})\;. \] However, you should resist the urge to compute this formula “from scratch” at every iteration. The reason is that the matrix multiplication \(\mathbf{X}^T\mathbf{X}\) has time-complexity \(O(np^2)\), where \(n\) is the number of data points and \(p\) is the number of features. Similarly, the matrix-vector product \(\mathbf{X}^T\mathbf{y}\) has time-complexity \(O(np)\). Both of these can be pretty expensive if you have a lot of data points! Fortunately, they don’t depend on the current value of \(w\), so you can actually just precompute them:

Once during the fit method, compute \(\mathbf{P}= \mathbf{X}^T \mathbf{X}\) and \(\mathbf{q}= \mathbf{X}^T \mathbf{y}\).
The gradient is then \(\nabla L(\mathbf{w}) = 2(\mathbf{P}\mathbf{w}- \mathbf{q})\).

Computing \(\mathbf{P}\mathbf{w}\) requires only \(O(p^2)\) steps. In other words, precomputing \(\mathbf{P}\) and \(\mathbf{q}\) eliminates the dependence of the runtime on the number of data points – not bad!

2 Demo

The following function will create both testing and validation data that you can use to test your implementation:

import numpy as np
from matplotlib import pyplot as plt

def pad(X):
    return np.append(X, np.ones((X.shape[0], 1)), 1)

def LR_data(n_train = 100, n_val = 100, p_features = 1, noise = .1, w = None):
    if w is None: 
        w = np.random.rand(p_features + 1) + .2
    
    X_train = np.random.rand(n_train, p_features)
    y_train = pad(X_train)@w + noise*np.random.randn(n_train)

    X_val = np.random.rand(n_val, p_features)
    y_val = pad(X_val)@w + noise*np.random.randn(n_val)
    
    return X_train, y_train, X_val, y_val

Here’s an example of how to use the function to generate data. Unfortunately, it’s only possible to easily visualize this problem when p_features = 1.

n_train = 100
n_val = 100
p_features = 1
noise = 0.2

# create some data
X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)

# plot it
fig, axarr = plt.subplots(1, 2, sharex = True, sharey = True)
axarr[0].scatter(X_train, y_train)
axarr[1].scatter(X_val, y_val)
labs = axarr[0].set(title = "Training", xlabel = "x", ylabel = "y")
labs = axarr[1].set(title = "Validation", xlabel = "x")
plt.tight_layout()

Once you’ve impmlemented your solution, you should be able to use it like this:

from solutions.linear_regression import LinearRegression

LR = LinearRegression()
LR.fit(X_train, y_train) # I used the analytical formula as my default fit method

print(f"Training score = {LR.score(X_train, y_train).round(4)}")
print(f"Validation score = {LR.score(X_val, y_val).round(4)}")

Training score = 0.3562
Validation score = 0.5122

The estimated weight vector \(\mathbf{w}\) is

LR.w

array([0.53369045, 0.33787198])

I can get the same value for \(\mathbf{w}\) using gradient descent (it would be even closer if we allowed more iterations).

LR2 = LinearRegression()

LR2.fit(X_train, y_train, method = "gradient", alpha = 0.01, max_iter = 1e2)
LR2.w

array([0.53504079, 0.33705831])

I can also see how the score changed over time. Because we’re not using stochastic gradient descent, the score should increase monotonically in each iteration.

plt.plot(LR2.score_history)
labels = plt.gca().set(xlabel = "Iteration", ylabel = "Score")

Your implementation is likely correct when you are able to reproduce results that are similar to these (although small differences are no problem).

3 Experiments

Once you’ve demonstrated the behavior above, perform an experiment in which you allow p_features, the number of features used, to increase, while holding n_train, the number of training points, constant. Try increasing p_features all the way to n_train - 1. What happens to the training score? What happens to the validation score? I’d suggest showing these results on a nice plot in which the horizontal axis is the number of features, the vertical axis is the score, and the training/validation scores are shown in different colors.

Optional: Relate your findings when p_features = n_train-1 to the existence of a solution of the equation \(\mathbf{X}\mathbf{w}= \mathbf{y}\). What do you know about the rank of \(\mathbf{X}\)? Remember that the number of columns in \(\mathbf{X}\) is actually p_features + 1, since it’s still necessary to pad with a column of 1s.

When discussing your findings, make sure to connect them to the idea of overfitting.

4 LASSO Regularization

The LASSO algorithm uses a modified loss function with a regularization term:

\[ L(\mathbf{w}) = \lVert \mathbf{X}\mathbf{w}- \mathbf{y} \rVert_2^2 + \alpha \lVert \mathbf{w}' \rVert_1\;. \]

Here, \(\mathbf{w}'\) is the vector composed of all the entries of \(\mathbf{w}\) excluding the very last entry. The 1-norm is defined as

\[ \lVert \mathbf{w}' \rVert_1 = \sum_{j = 1}^{p-1} \lvert w_j \rvert\;. \]

The effect of the regularizing term is to make the entries of the weight vector \(\mathbf{w}\) small. In fact, LASSO has a nice property: it tends to force entries of the weight vector to be exactly zero. This is very desirable in so-called overparameterized problems, when the number of features \(p\) is larger than the number of data points \(n\).

The reason we exclude the final entry of \(\mathbf{w}\) is that it is not desirable to penalize the weight corresponding to the constant feature in \(\mathbf{X}\).

Implementing LASSO involves some more complicated mathematical optimization than we will discuss in this course, so instead we’ll use the implementation in scikit-learn. You can import it like this:

from sklearn.linear_model import Lasso
L = Lasso(alpha = 0.001)

Here, alpha controls the strength of the regularization (it’s not related to the learning rate in gradient descent). Let’s fit this model on some data and check the coefficients:

p_features = n_train - 1
X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)
L.fit(X_train, y_train)

Lasso(alpha=0.001)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The score on the validation set is high, which might be different from what you found with pure linear regression.

L.score(X_val, y_val)

0.541222047393935

What You Should Do

Replicate the same experiment you did with linear regression, increasing the number of features up to or even past n_train - 1, using LASSO instead of linear regression. You might want to experiment with a few values of the regularization strength alpha. Comment on how your validation score compares to standard linear regression when the number of features used is large.

5 Optional: Bikeshare Data Set

The following code will download and save a data set related to the Capital Bikeshare system in Washington DC. We use the aggregated version graciously provided by the authors of the following paper:

Fanaee-T, Hadi, and Gama, Joao, “Event labeling combining ensemble detectors and background knowledge”, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, doi:10.1007/s13748-013-0040-3.

This data set includes information about the season and time of year; the weather; and the count of bicycle users on each day for two years (year 0 is 2011, year 1 is 2012). This level of information gives us considerable ability to model phenomena in the data.

For more on what the entries in each column means, you can consult the data dictionary here (under “Attribute Information”).

import pandas as pd
from sklearn.model_selection import train_test_split
bikeshare = pd.read_csv("https://philchodrow.github.io/PIC16A/datasets/Bike-Sharing-Dataset/day.csv")

bikeshare.head()

	instant	dteday	season	mnth	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	cnt
0	1	2011-01-01	1	1	6	0	2	0.344167	0.363625	0.805833	0.160446	331	654	985
1	2	2011-01-02	1	1	0	0	2	0.363478	0.353739	0.696087	0.248539	131	670	801
2	3	2011-01-03	1	1	1	1	1	0.196364	0.189405	0.437273	0.248309	120	1229	1349
3	4	2011-01-04	1	1	2	1	1	0.200000	0.212122	0.590435	0.160296	108	1454	1562
4	5	2011-01-05	1	1	3	1	1	0.226957	0.229270	0.436957	0.186900	82	1518	1600

Our aim for this case study is to plot daily usage by casual users (as opposed to registered users). The total number of casual users each day is given by the casual column, Let’s plot this over time:

# import datetime
fig, ax = plt.subplots(1, figsize = (7, 3))
ax.plot(pd.to_datetime(bikeshare['dteday']), bikeshare['casual'])
ax.set(xlabel = "Day", ylabel = "# of casual users")
l = plt.tight_layout()

For this prediction task, it’s handy to work with a smaller subset of the columns, and to transform the mnth column into dummy variables.

cols = ["casual", 
        "mnth", 
        "weathersit", 
        "workingday",
        "yr",
        "temp", 
        "hum", 
        "windspeed",
        "holiday"]

bikeshare = bikeshare[cols]

bikeshare = pd.get_dummies(bikeshare, columns = ['mnth'], drop_first = "if_binary")
bikeshare

	casual	weathersit	workingday	yr	temp	hum	windspeed	holiday	mnth_2	mnth_3	mnth_4	mnth_5	mnth_6	mnth_7	mnth_8	mnth_9	mnth_10	mnth_11	mnth_12
0	331	2	0	0	0.344167	0.805833	0.160446	0	0	0	0	0	0	0	0	0	0	0	0
1	131	2	0	0	0.363478	0.696087	0.248539	0	0	0	0	0	0	0	0	0	0	0	0
2	120	1	1	0	0.196364	0.437273	0.248309	0	0	0	0	0	0	0	0	0	0	0	0
3	108	1	1	0	0.200000	0.590435	0.160296	0	0	0	0	0	0	0	0	0	0	0	0
4	82	1	1	0	0.226957	0.436957	0.186900	0	0	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
726	247	2	1	1	0.254167	0.652917	0.350133	0	0	0	0	0	0	0	0	0	0	0	1
727	644	2	1	1	0.253333	0.590000	0.155471	0	0	0	0	0	0	0	0	0	0	0	1
728	159	2	0	1	0.253333	0.752917	0.124383	0	0	0	0	0	0	0	0	0	0	0	1
729	364	1	0	1	0.255833	0.483333	0.350754	0	0	0	0	0	0	0	0	0	0	0	1
730	439	2	1	1	0.215833	0.577500	0.154846	0	0	0	0	0	0	0	0	0	0	0	1

731 rows × 19 columns

Now we can do a train-test split.

train, test = train_test_split(bikeshare, test_size = .2, shuffle = False)

X_train = train.drop(["casual"], axis = 1)
y_train = train["casual"]

X_test = test.drop(["casual"], axis = 1)
y_test = test["casual"]

Train an instance of your LinearRegression class on the bikeshare training data. Then:

Score your model on the test set.
Compute the predictions for each day and visualize them in comparison to the actual ridership on the test set.
Compare the entries w of your model to the corresponding entry of X_train.columns in order to see which features your model found to contribute to ridership. Positive coefficients suggest that the corresponding feature contributes to ridership. Can you find effects corresponding to nice weather? Summer months? Holidays? Weekends?