Implementing Logistic Regression

2024-04-04

The expectations for all blog posts apply!

Introduction

Recently, we introduced the gradient descent algorithm for solving the empirical risk minimization problem. We also calculated the gradient of the loss function for logistic regression.

In this blog post you will:

  1. Implement gradient descent for logistic regression in an object-oriented paradigm.
  2. Implement a key variant of gradient descent with momentum in order to achieve faster convergence.
  3. Perform experiments to test your implementations.

Part A: Implement Logistic Regression

Getting Started

  1. Please create a new blog post. In addition to the usual .ipynb file, please also create a script called logistic.py in the same directory. This is the file in which you will implement logistic regression.
  2. Then, in your .ipynb notebook file, place the following in a Python code block underneath your metadata block:
%load_ext autoreload
%autoreload 2
from logistic import LogisticRegression, GradientDescentOptimizer

If you complete the optional implementation of Newton’s Method, you’ll also need to import NewtonOptimizer.

Implement LinearModel and LogisticRegression()

If you haven’t already, implement the methods of the LinearModel class as described in this warmup. Then, define a new class called LogisticRegression which inherits from LinearModel. This class should have two methods:

  • LogisticRegression.loss(X, y) should compute the empirical risk \(L(\mathbf{w})\) using the logistic loss function \[ \begin{aligned} L(\mathbf{w}) = \frac{1}{n} \sum_{i = 1}^n \left[-y_i \log \sigma(s_i) - (1-y_i)\log (1-\sigma(s_i))\right] \end{aligned} \] The weight vector \(\mathbf{w}\) used for this calculation should be stored as an instance variable of the class.
  • LogisticRegression.grad(X, y) should compute the gradient of the empirical risk \(L(\mathbf{w})\). You can use the formula for the gradient supplied in the lecture notes on gradient descent.
As usual, \(s_i = \langle \mathbf{w}, \mathbf{x}_i \rangle\) and \(\sigma(s) = \frac{1}{1 + e^{-s}}\).

For an M, you can implement LogisticRegression.grad using a for-loop. For an E, your solution should involve no explicit loops. While working on a solution that avoids loops, you might find it useful to at some point convert a tensor v with shape (n,) into a tensor v_ with shape (n,1). The code v_ = v[:, None] will perform this conversion for you.

Implement GradientDescentOptimizer

Next, implement a GradientDescentOptimizer class. For this project, we are going to implement gradient descent with momentum, also known as Spicy Gradient Descent. Let \(\mathbf{w}_k\) be the estimate of the weight vector at algorithmic step \(k\). Gradient descent with momentum performs the update \[ \begin{aligned} \mathbf{w}_{k+1} \gets \mathbf{w}_k - \alpha \nabla L(\mathbf{w}_k) + \beta(\mathbf{w}_k - \mathbf{w}_{k-1}) \end{aligned} \tag{1}\]

This description of gradient descent with momentum is based on Hardt and Recht (2022).

Here, \(\alpha\) and \(\beta\) are two learning rate parameters. When \(\beta = 0\) we have “regular” gradient descent. In practice, a choice of \(\beta \approx 0.9\) or so is common. Implementing the momentum term isn’t too complex – you’ll just need to create a new instance variable to hold the previous value of \(\mathbf{w}\) as well as the current one.

Part B: Experiments

Experimental Data

Here is some code to generate data for a classification problem. You can control the number of points by adjusting n_points, the number of features by adjusting p_dims, and the difficulty of the classification problem by adjusting the noise (higher noise is a harder problem).

def classification_data(n_points = 300, noise = 0.2, p_dims = 2):
    
    y = torch.arange(n_points) >= int(n_points/2)
    y = 1.0*y
    X = y[:, None] + torch.normal(0.0, noise, size = (n_points,p_dims))
    X = torch.cat((X, torch.ones((X.shape[0], 1))), 1)
    
    return X, y

X, y = classification_data(noise = 0.5)

How to Train Your Model

Once you’ve correctly implemented logistic regression and gradient descent, you can do a gradient descent loop like this:

LR = LogisticRegression() 
opt = GradientDescentOptimizer(LR)

for _ in range(100):
    # add other stuff to e.g. keep track of the loss over time. 
    opt.step(X, y, alpha = 0.1, beta = 0.9)

Experiments

Please perform experiments, with careful written explanations, that demonstrate the following statements:

  1. Vanilla gradient descent: When the number of features \(p_{\mathrm{dim}} = 2\), when \(\alpha\) is sufficiently small and \(\beta = 0\), gradient descent for logistic regression converges to a weight vector \(\mathbf{w}\) that looks visually correct (plot the decision boundary with the data). Furthermore, the loss decreases monotonically (plot the loss over iterations).
    • This is a good experiment to use to assess whether your implementation in Part A has bugs.
  2. Benefits of momentum: On the same data, gradient descent with momentum (e.g. \(\beta = 0.9\)) can converge to the correct weight vector in fewer iterations than vanilla gradient descent (with \(\beta = 0\)). Plot the loss over iterations for each method. You may need to experiment with the data and choice of \(\alpha\) in order to observe speedups due to momentum.
  3. Overfitting: Generate some data where p_dim > n_points. For example, p_dim = 100 and n_points = 50. Do this twice with the exact same parameters. Call the first dataset X_train, y_train and the second dataset X_test, y_test. Then, do an experiment in which you fit a logistic regression model to the data X_train, y_train and obtain 100% accuracy on this training data. What is the accuracy on the test data?

Part C: Writing

Please:

  1. Please include informative comments throughout your source code and a thorough docstring for each method in LogisticRegression and GradientDescentOptimizer.
  2. Please add careful expository writing throughout your blog post. You should describe each experiment and what it is intended to illustrate. You should also ensure that all your plots are legible and have appropriate axis labels and legends.
  3. At the beginning of your blog post, please place a link to your source code logistic.py on GitHub. After this link, please write an abstract paragraph describing the topic of your post and giving a brief overview of the experiments you performed.
  4. At the conclusion of your blog post, please write a discussion paragraph reminding the reader of what you did and what you learned while doing it.



© Phil Chodrow, 2024

References

Hardt, Moritz, and Benjamin Recht. 2022. Patterns, Predictions, and Actions. Princeton University Press.