Convex Linear Models and Logistic Regression

Author

Phil Chodrow

Quick Recap

Last time, we studied the perceptron algorithm for binary classification using hyperplanes. In doing so, we introduced the loss of a hyperplane, which we defined as the number of misclassifications made by the classifier based on that hyperplane.

We also saw that the perceptron has some major challenges associated with it. In this lecture, we’re going to extend the idea of loss to cover a broader range of models. Within this theory, we’ll be able to understand where some of the perceptron’s problems come from, and what to do about them.

Recall that our setup for the perceptron was as follows. We have data, a pair \((\mathbf{X}, \mathbf{y})\) where

  • \(\mathbf{X}\in \mathbb{R}^{n\times p}\) is the feature matrix. There are \(n\) distinct observations, encoded as rows. Each of the \(p\) columns corresponds to a feature: something about each observation that we can measure or infer. Each observation is written \(\mathbf{x}_1, \mathbf{x}_2,\ldots\). \[ \mathbf{X}= \left[\begin{matrix} & - & \mathbf{x}_1 & - \\ & - & \mathbf{x}_2 & - \\ & \vdots & \vdots & \vdots \\ & - & \mathbf{x}_{n} & - \end{matrix}\right] \]
  • \(\mathbf{y}\in \mathbb{R}^{n}\) is the target vector. The target vector gives a label, value, or outcome for each observation.

In the perceptron, we assumed that \(\mathbf{y}\in \{-1, 1\}^n\). We also assumed that we are going to try to linearly classify the points by finding a pair \((\mathbf{w}, b)\) that define a hyperplane. This is the set of points \(\mathbf{x}\in \mathbb{R}^n\) that satisfy the equation \[ \langle \mathbf{w}, \mathbf{x} \rangle - b = 0\;. \]

We saw that if we redefined \(\tilde{\mathbf{x}} = (\mathbf{x}, 1)\) and \(\tilde{\mathbf{w}} = (\mathbf{w}, -b)\), we could simply write this as \[ \langle \tilde{\mathbf{w}}, \tilde{\mathbf{x}} \rangle = 0\; \]

instead. For the remainder of these notes, we’ll simply write our feature vectors as \(\mathbf{x}\) and our parameter vector as \(\mathbf{w}\), assuming that the final entry of \(\mathbf{x}\) is also a 1.

  • For a given point \(\mathbf{x}\), we can make a prediction \(\hat{y} = \langle \mathbf{w}, \mathbf{x} \rangle\).
  • We decide that a prediction is accurate (and give ourself one “point”) if \(\hat{y}\) has the same sign as \(y\). This can be expressed in either of the following equivalent ways:
    • \(\mathbb{1}\left[ \mathrm{sign}(\hat{y}) = y \right]\)
    • \(\mathbb{1}\left[ \hat{y}y > 0 \right]\)
  • The overall accuracy or score is the accuracy rate averaged across the entire data set. We also defined the overall loss to be one minus the accuracy: \[ \begin{aligned} A(\mathbf{w}) &= \frac{1}{n}\sum_{i = 1}^n \mathbb{1}\left[ \hat{y}_iy_i > 0 \right]\\ &= \frac{1}{n}\sum_{i = 1}^n \mathbb{1}\left[ (\langle \mathbf{w}, \mathbf{x}_i \rangle)y_i > 0 \right] \\ L(\mathbf{w}) &= 1 - A(\mathbf{w}) \\ &= \frac{1}{n}\sum_{i = 1}^n \left(1 - \mathbb{1}\left[ \hat{y}_iy_i > 0 \right]\right) \\ &= \frac{1}{n}\sum_{i = 1}^n \left(1- \mathbb{1}\left[ (\langle \mathbf{w}, \mathbf{x}_i \rangle)y_i > 0 \right]\right)\;. \end{aligned} \]

We’d like to find \(\mathbf{w}\) to minimize the loss function. That is, we’d like to solve the problem

\[ \hat{\mathbf{w}} = \mathop{\mathrm{arg\,min}}_{\mathbf{w}} \; L(\mathbf{w}) \tag{1}\]

The loss is also often called the empirical risk, and this minimization problem is often called empirical risk minimization, for reasons that we’ll discuss in a coming lecture. The perceptron algorithm was one way to attack the empirical risk minimization problem.

Some Questions For Empirical Risk Minimization

It is at this point that we need to ask some important questions with awkward answers.

  1. Existence: Does Equation 1 have any solutions?
  2. Uniqueness: Assuming there exists a solution to Equation 1, is it unique? Or are there many different solutions?
  3. Searchability: Is it possible to write algorithms are guaranteed to find a solution of Equation 1?
  4. Performance: Is it possible to make these algorithms fast?

In most prediction problems, what we’d really like is to be right about the true value of \(y\). In the context of linear classifiers, this means that we want all the points of one label to be on one side of the line, and all the points of the other label to be on the other side. The loss function that expresses this idea is the 0-1 loss function, which is, again, the loss function used in the perceptron algorithm: \[ \ell(\hat{y}, y) = 1 - \mathbb{1}\left[ \hat{y}y > 0 \right]\;. \] When we graph the 0-1 loss function, it looks like this. I’ve shown versions corresponding to both values of the true label, \(y = 1\) and \(y = -1\).

Code
from matplotlib import pyplot as plt 
import numpy as np

plt.rcParams["figure.figsize"] = (10, 4)

fig, axarr = plt.subplots(1, 2) 
y_hat = np.linspace(-1, 1, 101)

loss = lambda y_hat, y: 1 - 1*(y_hat*y > 0)

for j in range(2):
    y = [-1, 1][j]
    axarr[j].set_title(f"y = {y}")
    axarr[j].set(xlabel = r"$\hat{y}$", 
                 ylabel = r"$\ell(\hat{y}, y)$")
    
    axarr[j].plot(y_hat, loss(y_hat, y))

plt.tight_layout()

Let’s now ask our four questions about empirical risk minimization for the 0-1 loss function.

Existence: Does Equation 1 have any solutions?

  • We are good on this one! Specifically, the risk can take on only a finite number of possible values (values between 0 and 1 in increments of \(1/n\)). In any given problem there is a smallest such value obtained, and this is a solution.

Uniqueness: Assuming there exists a solution to Equation 1, is it unique? Or are there many different solutions?

To define “usually” and “just a little bit” rigorously, we need to specify a data generating distribution and do some math.
  • Unfortunately, the solution to Equation 1 for the 0-1 loss is almost never unique. This is because, if you have one solution \(\mathbf{w}\), you can “usually” jiggle it by “just a little bit” and still have a minimizing solution.

Searchability: Is it possible to write algorithms are guaranteed to find a solution of Equation 1?

  • Technically, we could just try a very large number of choices of \(\mathbf{w}\) and hope for the best, but that’s not very efficient (in fact, the problem of getting a reasonable answer this way is exponential in \(d\), the number of features). Can we do better?

Performance: Is it possible to make these algorithms fast?

  • This is where our real problem lies:

Theorem 1 (0-1 Minimization for Linear Classifiers is NP Hard (Kearns, Schapire, and Sellie (1994))) Unless P = NP, there is no polynomial-time algorithm that can solve the 0-1 empirical risk minimization problem for linear classifiers.

So, if we are going to have reasonable algorithms for empirical risk minimization, we need to choose a different loss function. There are multiple choices. Before we jump into examples, we’re going to define the core mathematical concept that is going to help address our core questions of existence, uniqueness, searchability, and performance.

Convexity

Definition 1 A set \(S \subseteq \mathbb{R}^n\) is convex if, for any two points \(\mathbf{z}_1, \mathbf{z}_2 \in S\) and for any \(\lambda \in [0,1]\), the point \(\mathbf{z}= \lambda \mathbf{z}_1 + (1-\lambda) \mathbf{z}_2\) is also an element of \(S\).

Definition 2 (Convex Functions) Let \(S \subseteq \mathbb{R}^n\) be convex. A function \(f:S \rightarrow \mathbb{R}\) is convex if, for any \(\lambda \in \mathbb{R}\) and any two points \(\mathbf{z}_1, \mathbf{z}_2 \in S\), we have

\[ f(\lambda \mathbf{z}_1 + (1-\lambda)\mathbf{z}_2) \leq \lambda f( \mathbf{z}_1 ) + (1-\lambda)f(\mathbf{z}_2)\;. \]

The function \(f\) is strictly convex if the inequality is strict: for all \(\lambda\), \(\mathbf{z}_1\), and \(\mathbf{z}_2\),

\[ f(\lambda \mathbf{z}_1 + (1-\lambda)\mathbf{z}_2) < \lambda f( \mathbf{z}_1 ) + (1-\lambda)f(\mathbf{z}_2)\;. \]

Roughly, a convex function is “bowl-shaped.”

Definition 3 (Local and Global Minimizers) A point \(\mathbf{z}\in S\) is a global minimizer of the function \(f:S \rightarrow \mathbb{R}\) if \(f(\mathbf{z}) \leq f(\mathbf{z}')\) for all \(\mathbf{z}' \in S\).

A point \(\mathbf{z}\in S\) is a local minimizer of \(f:S \rightarrow \mathbb{R}\) if there exists a neighborhood \(T \subseteq S\) containing \(\mathbf{z}\) such that \(\mathbf{z}\) is a global minimizer of \(f\) on \(T\).

It’s ok if you don’t know what it means for a set to be closed – all the convex functions we will care about in this class will either be defined on sets where this theorem holds or will be otherwise defined so that the conclusions apply.

Theorem 2 Let \(f:S \rightarrow \mathbb{R}\) be a convex function. Then:

  1. If \(S\) is closed and bounded, \(f\) has a minimizer \(\mathbf{z}^*\) in \(S\).
  2. Furthermore, if \(\mathbf{z}^*\) is a local minimizer of \(f\), then it is also a global minimizer.
  3. If in addition \(f\) is strictly convex, then this minimizer is unique.

Proof. The proof of item 1 needs some tools from real analysis. The short version is:

  • Every convex function is continuous.
  • If \(S\subseteq \mathbb{R}^n\) is closed and bounded, then it is compact.
  • Continuous functions achieve minimizers and maximizers on compact sets.

It’s ok if you didn’t follow this! Fortunately the second part of the proof is one we can do together. Suppose to contradiction that \(\mathbf{z}^*\) is a local minimizer of \(f\), but that there is also a point \(\mathbf{z}'\) such that \(f(\mathbf{z}') < f(\mathbf{z}^*)\). Since \(\mathbf{z}^*\) is a local minimizer, we can find some neighborhood \(T\) containing \(\mathbf{z}^*\) such that \(\mathbf{z}^*\) is a minimizer of \(f\) on \(T\). Let \(\lambda\) be some very small number and consider the point \(\mathbf{z}= \lambda \mathbf{z}' + (1-\lambda)\mathbf{z}^*\). Specifically, choose \(\lambda\) small enough so that \(\mathbf{z}\in T\) (since this makes \(\mathbf{z}\) close to \(\mathbf{z}^*\)). We can evaluate

\[ \begin{align} f(\mathbf{z}) &= f(\lambda \mathbf{z}' + (1-\lambda)\mathbf{z}^*) \tag{definition of $\mathbf{z}$}\\ &\leq \lambda f(\mathbf{z}') + (1-\lambda)f(\mathbf{z}^*) \tag{$f$ is convex} \\ &= f(\mathbf{z}^*) + \lambda (f(\mathbf{z}') - f(\mathbf{z}^*)) \tag{algebra}\\ &< f(\mathbf{z}^*)\;. \tag{assumption that $f(\mathbf{z}') < f(\mathbf{z}^*)$} \end{align} \]

But this is a contradiction, since we constructed \(\mathbf{z}\) to be in the neighborhood \(T\) where \(\mathbf{z}^*\) is a local minimizer. We conclude that there is no \(\mathbf{z}'\) such that \(f(\mathbf{z}') < f(\mathbf{z}^*)\), and therefore that \(\mathbf{z}^*\) is a global minimizer.

The proof of the third part follows a very similar argument to the proof of the second part!

There’s two other very important math facts that we need in order to apply convexity to the empirical risk minimization problem for linear models.

By induction, it follows that any linear combination of convex functions with positive coefficients is convex.

Theorem 3  

  1. Let \(f_1\) and \(f_2\) be convex functions with the same domain, and let \(a\) and \(b\) be nonnegative real numbers. Then, the function \(f\) defined by \(f(\mathbf{z}) = af_1(\mathbf{z}) + bf_2(\mathbf{z})\) is also convex.
  2. Let \(f:\mathbb{R}^n\rightarrow \mathbb{R}\) be convex. Let \(\mathbf{A}\in \mathbb{R}^{n\times p}\) and \(\mathbf{b} \in \mathbb{R}^n\). Then, the function \(f_\mathbf{A}\) defined by \(f_{\mathbf{A},\mathbf{b}}(\mathbf{z}) = f(\mathbf{A}\mathbf{z}- \mathbf{b})\) is convex.

Convexity and Empirical Risk Minimization

Let’s finally go back to the empirical risk minimization problem for linear models. We’re going to write it in terms of a general loss function \(\ell\); the choice \(\ell(\hat{y}, y) = 1 - \mathbb{1}\left[ \hat{y}y > 0 \right]\) gets us back to the 0-1 loss situation. The general empirical risk minimization problem for linear classifiers is

\[ \hat{\mathbf{w}} = \mathop{\mathrm{arg\,min}}_{\mathbf{w}} \frac{1}{n} \sum_{i = 1}^n \ell(\langle \mathbf{w}, \mathbf{x}_i \rangle, y_i)\;. \tag{2}\]

Let’s now assume that the loss function \(\ell\) is strictly convex in its first argument: that is, for any possible value of \(y\) and any \(\lambda \in [0,1]\), \[ \ell(\lambda \hat{y}_1 + (1-\lambda)\hat{y}, y) \leq \lambda \ell(\hat{y}_1, y) + (1-\lambda)\ell(\hat{y}, y)\;. \] Then, suddenly the following things would all also be true:

  1. \(\ell(\langle \mathbf{w}, \mathbf{x} \rangle, y)\) is strictly convex as a function of \(\mathbf{w}\) (Theorem 3, part 2).
  2. The empirical risk \(L(\mathbf{w}) = \frac{1}{n}\sum_{i = 1}^n \ell(\langle \mathbf{w}, \mathbf{x}_i \rangle, y_i)\) is strictly convex as a function of \(\mathbf{w}\) (Theorem 3, part 1).
  3. If the empirical risk \(R(\mathbf{w})\) has a global minimizer, that global minimizer is unique (Theorem 2, part 3).
  4. The empirical risk \(R(\mathbf{w})\) has no local minimizers which are not global minimizers.

These facts have important implications for our fundamental questions on empirical risk minimization.

Existence. Even convex functions are not guaranteed to have minimizers. However, there are lots of choices of loss function \(\ell\) which do guarantee that the empirical risk has a minimizer.

Uniqueness: When the empirical risk is strictly convex, there can only be one global minimizer.

Searchability: When the empirical risk is strictly convex, there are also no local minimizers other than the global minimizer. Algorithmically, this is the most important property of convexity. It means that if I manage to find any local minimizer at all, that point must be the global minimizer.

Performance: Convexity significantly reduces the difficulty of our task: instead of trying to find “the best” solution, it’s sufficient for us to find any local optimum. This means that we can design our algorithms to be “greedy local minimizer hunters.” There are lots of fast algorithms to do this. An especially important class of algorithms are gradient descent methods, which we’ll discuss soon.

If you’ve taken an algorithms class, one way of thinking of convexity is that it guarantees that greedy methods work for solving minimization problems.

Demo: Logistic Regression

Let’s do a partial implementation of logistic regression to illustrate these techniques. In logistic regression, we assume that \(y \in \{0,1\}\). Our loss function is the logistic loss:

\[ \ell(\hat{y}, y) = -y \log \sigma(\hat{y}) - (1-y)\log (1-\sigma(\hat{y}))\;, \]

where \(\sigma(z) = \frac{1}{1 + e^{-z}}\) is the logistic sigmoid function.

The logistic loss is convex in \(\hat{y}\), although proving this requires a little bit of extra math that we won’t discuss. Here’s a “proof by picture” for the case when the label is \(y = 1\).

Code
z = np.linspace(0, 5, 101)
plt.plot(z, -np.log(1/(1 + np.exp(-z)))) 
labs = plt.gca().set(xlabel = r"$\hat{y}$", ylabel = r"$-\log \sigma(\hat{y})$")

Because the logistic loss is convex in \(\hat{y}\), the empirical risk minimization problem can have at most one minimum. In fact, it’s possible to show, if the data is not linearly separable, there exists a global minimum.

Here is some sample data for which we will try to find a good linear classifier.

Code
from sklearn.datasets import make_blobs

p_features = 3

X, y = make_blobs(n_samples = 100, n_features = p_features - 1, centers = [(-1, -1), (1, 1)])

fig = plt.scatter(X[:,0], X[:,1], c = y)
xlab = plt.xlabel("Feature 1")
ylab = plt.ylabel("Feature 2")

Note that this data is not linearly separable. The perceptron algorithm wouldn’t even have converged for this data set, but logistic regression will do great.

Now let’s do an implementation. First let’s define a linear predictor function of the form \(f(\mathbf{x}) = \langle w, \mathbf{x} \rangle\). Note that this predictor makes the predictions on all the training data at once!

import numpy as np 
from scipy.optimize import minimize

# logistic regression tends to involve a lot of log(0) and things that wash out in the end. 
np.seterr(all='ignore') 

# add a constant feature to the feature matrix
X_ = np.append(X, np.ones((X.shape[0], 1)), 1)

def predict(X, w):
    return X@w

Now we’ll define some functions to compute the empirical risk:

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# returns a vector containing the per-observation logistic loss for each observation
def logistic_loss(y_hat, y): 
    return -y * np.log(sigmoid(y_hat)) - (1 - y)*np.log(1 - sigmoid(y_hat))

# first compute the predictions, then compute the average loss per observation
# note that this works on the ENTIRE DATA SET AT ONCE: no for-loops
def empirical_risk(X, y, w, loss):
    y_hat = predict(X, w)
    return loss(y_hat, y).mean()

Finally, we can write the function that will solve the empirical risk minimization problem for us. We’re going to use the scipy.optimize.minimize function, which is a built-in function for solving minimization problems. Soon, we’ll study how to solve minimization problems from scratch.

The scipy.optimize.minimize function requires us to pass it a single function that accepts a vector of parameters, plus an initial guess for the parameters.

def find_pars(X, y):
    
    p = X.shape[1]
    w0 = np.random.rand(p) # random initial guess
    
    # perform the minimization
    result = minimize(lambda w: empirical_risk(X, y, w, logistic_loss), 
                      x0 = w0) 
    
    # return the parameters
    return result.x

Ok, let’s try it and take a look at the parameters we obtained. Because the final column of X_ is the constant column of 1s, the final entry of w is interpretable as the intercept term b.

w = find_pars(X_, y)
w
array([2.33621929, 1.83628124, 0.37074447])

And, finally, we can plot the linear classifier that we learned.

fig = plt.scatter(X[:,0], X[:,1], c = y)
xlab = plt.xlabel("Feature 1")
ylab = plt.ylabel("Feature 2")

f1 = np.linspace(-3, 3, 101)

plt.plot(f1, (w[2] - f1*w[0])/w[1], color = "black")

Since the logistic loss is convex, we are guaranteed that this solution is the unique best solution (as measured by the logistic loss). There is no other possible set of parameters that would lead to a better result (again, as measured by the logistic loss).



© Phil Chodrow, 2023

References

Kearns, Michael J, Robert E Schapire, and Linda M Sellie. 1994. “Toward Efficient Agnostic Learning.” Machine Learning 17: 115–41.