Warmup Exercises

Author

Phil

Perceptron

Part 1

Sketch the line in \(\R^2\) described by the equation \[ \bracket{\vw, \vx} = b\;, \tag{1}\]

where \(\vw = \paren{1, -\frac{1}{2}}^T \in \R^2\) and \(b = \frac{1}{2}\). Here, \(\bracket{\vw, \vx} = \sum_{i = 1}^n w_i x_i\) is the inner product (or dot product) between the vectors \(\vw\) and \(\vw\).

Part 2

Write a quick Python function called perceptron_classify(w, b, x). w and x should both be 1d numpy arrays of the same length, and b should be a scalar. Your function should return 0 if \(\bracket{\vw, \vx} < b\) and 1 if \(\bracket{\vw, \vx} \geq b\).

An excellent solution will use neither a for-loop nor an if-statement.

Verify that your function works on a few simple examples.

Part 3

Consider a line of the general form of Equation 1. Let’s allow \(\vx\) and \(\vw\) to both be \(n\)-dimensional, so that this equation defines a hyperplane in \(\R^n\). Suppose that we wanted to represent the same hyperplane in \(\R^n\) using an equation of the form

\[ \bracket{\tilde{\vw}, \tilde{\vx}} = 0\;. \tag{2}\]

for some \(\tilde{\vw} \in \R^{n+1}\). Define \(\tilde{\vx} = (\vx, 1)\). How could you define \(\tilde{\vw}\) to make Equation 2 equivalent to Equation 1?

Convexity

As you learned in Daumé, informally, a convex function is a function that is “bowl-shaped.” Hardt and Recht give the formal definition, which has the benefit of applying to functions of many variables.

Part 1

Consider the 0-1 step function that I’ve plotted below:

from matplotlib import pyplot as plt 
import numpy as np

fig, ax = plt.subplots(1, 1) 
y_hat = np.linspace(-1, 1, 101)

loss = lambda y_hat, y: 1 - 1*(y_hat*y > 0)

ax.set(xlabel = r"$\hat{y}$", 
       ylabel = r"$\ell(\hat{y}, y)$")

ax.plot(y_hat, loss(y_hat, 1))

Show pictorially that this function is not convex. No proof needed – just the right drawing.

Part 2: Second Derivative Test

Another way to tell whether a function is convex is to check its second derivative. If a function \(f:S\rightarrow \R\) has a convex domain \(S\subseteq \R\), if \(f\) is everywhere twice-differentiable, and if \(\frac{d^2f(z_0)}{dz^2} > 0\) for all \(z_0 \in S\), then \(f\) is convex.

Use the second derivative test to check that the following two functions are convex:

The base of the logarithm doesn’t really matter, but for this course it is always most convenient to assume logs base \(e\), which you might also have seen written \(\ln\).

\[ \begin{aligned} f(z) &= - \log z \\ g(z) &= - \log(1-z)\;. \end{aligned} \]

Part 3: Plotting Practice

In a Jupyter notebook, write a simple program to plot each of the functions \(f\) and \(g\) from Part 2. Some of the Part 1 code is likely to help you.

Part 4: Convexity in Many Variables

Recall the Hardt and Recht definition of convexity: a function \(f:\R^p \rightarrow \R\) is convex if, for any \(\lambda \in [0,1]\) and any points \(\vz_1, \vz_2 \in \R^p\),

\[ f(\lambda \vz_1 + (1-\lambda)\vz_2) \leq \lambda f(\vz_1) + (1-\lambda)f(\vz_2)\;. \]

Using this definition, write a short mathematical proof that the function \(f(\vz) = \norm{\vz} = \sqrt{\bracket{\vz, \vz}}\) is convex. You will want to use the triangle inequality, which says that \(\norm{\vz_1 + \vz_2} \leq \norm{\vz_1} + \norm{\vz_2}\).

This proof requires just a few lines if you carefully use your definitions!

Gradient Descent

Consider the quadratic function \(g(z) = \frac{1}{2}az^2 + bz + c\).

Prove that \(g\) has a critical point at the point \(z^* = -\frac{b}{a}\) (hint: solve \(g'(z^*) = 0\)).
What must be true about the constants \(a\), \(b\), and \(c\) to ensure that this point is a local minimum of \(g\)? (Hint: second derivative test).
Suppose now that we are able to evaluate the function \(g\), as well as its derivative \(g'\), but not able to use algebra to find \(z^*\) (this mirrors our situation in most practical problems). Instead, we are going to use the following algorithm to attempt to approximate \(z^*\):
- Begin with some initial guess \(z^{(0)}\).
- In each time-step \(t\), compute \(z^{(t+1)} \gets z^{(t)} - \alpha g'(z^{(t)})\), where \(\alpha > 0\) is the learning rate.
- In practice we would need to specify a stopping criterion, but for this theoretical problem we don’t need to worry about it.
Using algebra, prove that for any timestep \(t\), \[ (z^* - z^{(t+1)})^2 = (a\alpha - 1)^2(z^* - z^{(t)})^2\;. \]
Let’s think of \(\abs{z^* - z^{(t)}}\) as the error in our current estimate \(z^{(t)}\). Using the recurrence above, conclude that, for any \(t\), the error \(\abs{z^* - z^{(t)}}\) satisfies \[ \abs{z^* - z^{(t)}} = \abs{a\alpha - 1}^{t}\abs{z^* - z^{(0)}}\;. \]
For \(\alpha \in (\alpha_*, \alpha^*)\), we are guaranteed that the error \(\abs{z^* - z^{(t)}}\rightarrow 0\) as \(t\rightarrow \infty\). What are \(\alpha_*\) and \(\alpha^*\)?
Suppose that \(\alpha\) is within the necessary range. I want to guarantee that \(\abs{z^* - z^{(t)}} < \epsilon\) for some small \(\epsilon > 0\) (in practice we often call this the tolerance). Conclude that the number of steps necessary to reach this tolerance is no greater than \[ \bar{t} = \frac{ \log \epsilon - \log \abs{z^* - z^{(0)}}}{\log \abs{a\alpha - 1}}\;. \]

Ignoring constants with respect to \(\epsilon\), we say that this algorithm for finding the minimum of \(g\) with tolerance \(\epsilon\) has a \(\log \epsilon\) a convergence rate.

Gradient Descent (Again)

Consider the function \(f(w_0, w_1) = \sin(w_0w_1)\). You can define this function like this:

import numpy as np
def f(w):
    return np.sin(w[0]*w[1])

Mathematically, the gradient of this function is

\[\nabla f(w_0, w_1) = (w_1\cos w_0w_1, w_0 \cos w_0w_1)^T.\]

Implement a simple loop that uses gradient descent to find a minimum of this function.
- You’ll have to choose the learning rate \(\alpha\).
- The np.cos() function will be useful for programming the gradient.
- It’s not the fastest approach, but if you’re not show how to program the gradient you can always first implement it as a list of two floats, and then use np.array(my_list) to convert it into a numpy array.
- You’ll also need to pick a random starting guess.
Find two initial guesses for the parameter vector \(\vw\) such that you get two different final minimizers (this is possible because \(f\) is not convex).

Overfitting and the Scientific Method

Image from Wikipedia.

In the scientific method, it is often emphasized that we need to formulate a hypothesis before performing an experiment. It’s fine for the hypothesis to be based on previous experiments. However, the scientific method never allows us to perform an experiment, formulate a hypothesis, and then say that the experiment supported the (new) hypothesis.

We can think of scientific theories as systems of thought that help us make predictions about new phenomena. With this in mind, please write a short paragraph explaining the importance of hypothesis-first science using the language of machine learning. In your explanation, please use the following vocabulary:

Training data.
Training accuracy.
Validation/testing data.
Validation/testing accuracy.
Overfitting.

The Coin-Flipping Game

Let’s play a game! Here is the setup:

I have a coin with probability of heads equal to \(p \in [0,1]\). I am going to ask you to pick a number \(\hat{p} \in [0,1]\). Then, I flip my coin.

This game is more fun for me than it is for you.

If my coin comes up heads, you give me \(-\log \hat{p}\) dollars.
If my coin comes up tails, you give me \(-\log (1-\hat{p})\) dollars.

Part 1

Compute the expected amount of money you will give me when we play this game in terms of \(p\) and \(\hat{p}\). Call this quantity \(R(\hat{p}, p)\). This is the risk of the guess \(\hat{p}\).

Part 2

Take the derivative and set it equal to 0! Don’t forget to check that you’ve found a minimum of \(R(\hat{p}, p)\) rather than a maximum or an inflection point.

Suppose I tell you the value of \(p\). Write a mathematical proof to show that your best choice of \(\hat{p}\) (the one that loses you the least money) is \(\hat{p} = p\).

Part 3

Now suppose that I don’t tell you the true value of \(p\). Instead, I let you observe \(n\) coin flips before asking you to make your guess. Describe:

A suggestion for choosing \(\hat{p}\) based only on the results of the previous flips.
A way to estimate the risk (expected amount of money lost) based only on the results of the previous flips.

Your answer should depend on \(\hat{p}\) but not on \(p\)!

Balancing Classification Rates

You can do this first part just by copying and pasting lecture code. It doesn’t matter much how good your model is – just make sure you’re able to get predictions.

Use the code from our recent lecture to download the Titanic data set as a Pandas data frame and train a model on the training data. Then download the test data. Compute y_pred, the vector of predictions of your model on the test data.

Then, write a function that verifies eq. (2.6) in Alexandra Chouldechova’s paper “Fair Prediction with disparate impact.” Here’s what your function should do:

The positive predictive value is \(\mathrm{PPV} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}\).

Given vectors y_pred of predictions and y_test of actual labels, compute the False Negative Rate (FNR), False Positive Rate (FPR), prevalence \(p\), and positive predictive value (PPV).
Return as a tuple the lefthand side and righthand side of eq. (2.6) in Chouldechova.
Verify that the two numbers are equal!

Limits of The Quantitative Approach to Discrimination

I’ll give you each a number in Slack. The numbers correspond to the following sections of Narayanan (2022). These are:

The null hypothesis allocates the burden of proof (p. 7-8)
Compounding inequality is far below the radar of quantitative methods (p. 9-10)
Snapshot datasets hide discrimination (p. 10-11)
Explaining away discrimination (p. 12-13)
What counts as evidence is a subjective choice (p. 5-7)

For your assigned section, please write a short paragraph (4-5 simple sentences is fine). You should:

Summarize Narayanan’s key points in that section.
In one of the sentences, describe which aspects of the Uber case study (p. 13-16) reflect the ideas of the section you described.

Bring your paragraph in class and be ready to read it to your group.

Vectorization Brainstorm

In a recent lecture, we discussed methods of vectorizing text like the document-term matrix that use the bag of words assumption: the order of the words doesn’t matter!

Take some time and propose an alternative approach to word-based text vectorization. Can you find a scheme that would give different vector representations to the following two sentences?

“I love rabbits, not cats.” “I love cats, not rabbits.”

You don’t have to implement your vectorization, but please be prepared to write pseudocode for your group to show in detail how you would perform the vectorization.

Image Compression Factor of K-Means

In today’s reading on K-means clustering from the Python Data Science Handbook, Jake VanderPlas considers the use of K-means to reduce the number of distinct colors in an image (Example 2). I encourage you to run the code for this example while thinking about this warmup!

Give an estimate of the compression factor: the reduction of information achieved when compressing an image using k-means clustering into \(k\) color clusters. The compression factor is the number of bits required to store the compressed image, divided by the number of bits required to store the original image. Both of these numbers can be computed asymptotically (i.e. with big-oh reasoning) in order to simplify the analysis.

There are multiple good ways to think about this question, and you’re welcome to choose one that makes sense to you as long as you carefully state your steps and assumptions. Here are a few points that I find helpful:

Bits in Original Image

An image with \(n\) rows and \(m\) columns has \(nm\) pixels.
Each pixel has one of three RGB color channels (Red, Green, and Blue).
Each color channel can be represented with 8 bits (which encode an integer between 0 and 255, denoting the color intensity in that channel).

Bits in Compressed Image

If I compress an image into just \(k\) distinct colors, then instead of storing the full RGB value for each pixel, I can just store enough bits to uniquely identify the cluster containing each pixel. How many bits do I need for this?
I also need to store a dictionary (hash map) that associates color \(j\) (i.e. the centroid of the \(j\)th cluster of colors) to its RGB value.

Optional Extra

Try running the code above while varying the number of clusters. Do you think that a 16-color compression looks much better than an 8-color compression. Do you think the difference is good enough to justify approximately twice the storage? What about 32 colors vs. 16?

Introducing Tensors

First, install PyTorch 2.0 into your ml-0451 Anaconda environment.

The best way to install PyTorch is is probably to run the following at the command line:

conda activate ml-0451
pip3 install torch torchvision torchaudio

Then, in a blank Jupyter notebook, copy, paste, and run each of the code blocks in the first section of the PyTorch tutorial.

You may need to do a little bit of exploring around the tutorials in order to come up with answers to these questions.

Finally, write down a single-sentence answer to each of the following questions:

In what ways is a PyTorch tensor similar to a Numpy array?
In what ways is a PyTorch tensor different from a Numpy array?
What is the primary motivation for the use of a specialized tensor data type, rather than an array, for deep learning?

Efficient Differentiation

This exercise is based on a section of Chinmay Hegde’s notes on stochastic gradient descent and neural networks:

Consider the following function: \[ L(w, b) = \frac{1}{2} \left(y - \sigma(wx + b)\right)^2 + \frac{1}{2}\lambda w^2\;, \]

where \(\sigma(a) = \frac{1}{1 + e^{-a}}\).

This is the loss function that would be obtained when using a single feature \(x\) to predict \(y\), using the function \(\sigma(wx + b)\) the predictor and measuring the quality of this predictor using the square-error loss function. Our aim is to compute the gradient of \(L\) with respect to \(w\) and \(b\), with a “ridge” regularization term \(\frac{1}{2}\lambda w^2\) that encourages the weight \(w\) to be small.

I’ve used the property of the sigmoid that \(\sigma'(a) = \sigma(a)(1-\sigma(a))\).

The gradient of \(L\) is \(\nabla L (w, b) = \left(\frac{\partial L}{\partial w}, \frac{\partial L}{\partial b}\right)\), where

\[ \begin{aligned} \frac{\partial L}{\partial w} &= (\sigma(wx + b) - y)\sigma(wx+b)(1 - \sigma(wx+b))x + \lambda w \\ \frac{\partial L}{\partial b} &= (\sigma(wx + b) - y)\sigma(wx+b)(1 - \sigma(wx+b))\;. \end{aligned} \tag{3}\]

What You Should Do

Assume that each of the following operations cost one computational unit:

Multiplying or dividing two scalar numbers.
Adding or subtracting two scalar numbers.
Computing an exponential like \(e^a\).

Using this assumption:

Determine the number of computational units (i.e. computational cost) of computing the gradient of \(L\) exactly as written in Equation 3, under the assumption that you are not allowed to store the values of any intermediate computations.
Now determine the computational cost of computing the gradient of \(L\) under the assumption that you are allowed to store intermediate computations. Please describe both the number of computations and the number of floating point numbers that must be stored.
Finally, determine the computational cost in terms of both steps and storage to compute \(L\) using the backpropagation algorithm (described for a very similar function in Hegde’s notes).

Compare your results from each method.

Convolutional Kernels

Implement kernel convolution. Your implementation should accept a 2d array X (think of X as representing a greyscale image) and a square convolutional kernel K. Your implementation should operate using pure numpy. You can use any zero-padding strategy, but you do need to explain what your strategy is when presenting.

It’s ok to use a for-loop to loop over pixels.

Here’s an example image you can use:

from PIL import Image
import urllib
import numpy as np
from matplotlib import pyplot as plt

def read_image(url):
    return np.array(Image.open(urllib.request.urlopen(url)))

url = "https://i.pinimg.com/originals/0e/d0/23/0ed023847cad0d652d6371c3e53d1482.png"

img = read_image(url)

def to_greyscale(im):
    return 1 - np.dot(im[...,:3], [0.2989, 0.5870, 0.1140])

img = to_greyscale(img)

plt.imshow(img, cmap = "Greys")
plt.gca().axis("off")

(-0.5, 639.5, 412.5, -0.5)

After implementing your function, you should be able to use it like this, replacing the implementation of scipy.signal with your own implementation. The result should look something like this:

from scipy.signal import convolve2d

kernel = np.array([[-1, -1, -1], 
                   [-1,  8, -1], 
                   [-1, -1, -1]])

convd = convolve2d(img, kernel)

plt.imshow(convd, cmap = "Greys", vmin = 0, vmax = 8)
plt.gca().axis("off")

(-0.5, 641.5, 414.5, -0.5)

Project Check-In

This is a warmup activity that we will repeat weekly until the end of the semester.

Prepare a 2-3 minute “presentation” of your project to your group. Your presentation can be informal and does not need to have any special visual aids. The primary expectation is that you are able to demonstrate some relevant functionality to your peers. If your project involves coding or data analysis, your relevant functionality might be as simple as accessing or preparing the data. You should plan to demonstrate additional functionality each week.

In other words, you should show your group something that works, regardless of how “big” it is.

If you are doing a project that does not involve implementation (such as a research essay), then you are still expected to offer an update. Your contributions could include describing the sources you’ve found or showing your group an outline of the argument that you will make.

It’s appropriate for each member of your project group to give the same presentation during warmup. Please note that you may not be in the same warmup group as your project partners. This means that:

The code you show needs to run on your laptop or in your compute instance (e.g. Google Colab).
You need to be ready to explain what is being shown, even if your project partners did much of the work.

What Needs To Be Learned?

Suppose that you wanted to teach an individual to recognize English-language phishing emails. Write down a few features (based on the linked website, your own experience, or other sources) that you think would help someone classify an email as “phishing attempt or not” based on the text of the email.

Now, imagine that you are going to sit down with your tutee to teach them how to recognize English-language phishing emails. Where would you start your instruction if…

Your tutee was another member of your warmup group.
Your tutee was a fluent English speaker but had never used email.
Your tutee was a regular email user but spoke no English.
Your tutee spoke no English and had never seen a computer.

Which of these four scenarios would require the most “learning effort?” Which would require the least?

Word Embedding

Take out a sheet of paper and a pencil. Your goal is to place the following words on the sheet of paper in such a way that their location on the sheet is indicative of their relationships to each other. You can decide exactly how to do this. Should words with similar meanings be in the same part of the page? Should pairs of words with similar relationships have similar distances? Your approach is up to you, but please write it down along with your placements. Your words are:

Woman
Student
Nurse
Doctor
Man
Professor
Model
Computer
Machine
Programmer

Realistic Text?

In our reading on The Unreasonable Effectiveness of Recurrent Neural Networks, there are a few examples of model output that is realistic but not real. Pick one of the examples, and write down as carefully as you are able what makes the generated text realistic. Then, describe what “tells” would tip off an attentive observer that the text isn’t real (generated intentionally by a human) after all.

The Shakespeare and Wikipedia examples might be the easiest ones to think about, but feel free to look at the \(\LaTeX\) or Linux source code examples if you prefer.

Mind Map

Wow, we’ve covered a lot of ground in this class! Use a graphics program or a pen/paper to make a mind map describing some of our main theoretical and concepts. As a reminder, a mind-map is a graph in which the nodes are concepts and edges join related concepts. Please incorporate the following concepts as nodes:

Loss function
Target
Predictor
Model
Regression
Classification
Empirical risk minimization
Gradient descent
Feature map
Vectorization
Overfitting
Training data
Validation/testing data
Perceptron
Logistic regression
Neural networks
Linear regression

Additionally, please incorporate at least three other concepts of your own choosing.

Flowcharts in Quarto

Since mind-maps can be a little complicated to organize, you might find it easiest to work with some software. One optional possibility is actually included with Quarto: the Mermaid chart tool can render attractive diagrams that include labeled nodes and directed edges. Using a flowchart is probably the way. For example, inserting the following code into a special {mermaid} code block will produce the following diagram:

flowchart TB
    A(First concept) --> B(Second concept)
    B --is part<br> of--> A
    B-->C(Third concept)
    A-->C
    A & B & C -->D(Fourth concept)

flowchart TB
    A(First concept) --> B(Second concept)
    B --is part<br> of--> A
    B-->C(Third concept)
    A-->C
    A & B & C -->D(Fourth concept)

For the basics of using Mermaid with Quarto, see the Quarto docs. A benefit of this approach is that you don’t have to worry too much about positioning, and you can publish your mind map easily on your blog! On the other hand, this approach doesn’t give you much flexibility and makes it harder to be creative about incorporating complex relationships. Doing your mind map by hand or in any other software is entirely fine.

Classification Rates

Part 1

COVID-19 rapid tests have approximately an 80% sensitivity rate, which means that, in an individual who truly has COVID-19, the probability of a rapid test giving a positive result is roughly 80%. On the other hand, the probability of a rapid test giving a positive result for an individual who truly does not have COVID-19 is 5%. Suppose that approximately 4% of the population are currently infected with COVID-19.

These numbers are mostly made-up.Example 2.3.1 of Murphy, page 46, has a good review of the relevant probability and the definition of each of the rates below.

Write a Python function called rate_summary that prints the following output, filling in the correct values for each of the specified rates:

s = 0.8           # test sensitivity
f = 0.02          # probability of positive test if no COVID
prevalence = 0.05 # fraction of population infected

rate_summary(s, f, current_infection)

The true positive rate is ___.
The false positive rate is ___.
The true negative rate is ___. 
The false positive rate is ___.

Part 2

Suppose that scientists found an alternative rapid test which had a 75% sensitivity rate with a 0% chance of a positive test on someone who is truly not infected. Would you suggest replacing the old rapid tests with these alternative tests? Why?
What if the alternative test had an 85% sensitivity rate and a 10% chance of a positive test on someone who is truly not infected?

You don’t necessarily need to use your function from the previous part in this part.

Part 3

It’s all well and good to do the math, but what about when we actually have data? Write a function called rate_summary_2 that accepts two columns of a pandas.DataFrame (or equivalently two one-dimensional numpy.arrays of equal length). Call these y and y_pred. Assume that both y and y_pred are binary arrays (i.e. arrays of 0s and 1s). y represents the true outcome, whereas y_pred represents the prediction from an algorithm or test. Here’s an example of the kind of data we are thinking about:

import pandas as pd

url = "https://github.com/middlebury-csci-0451/CSCI-0451/raw/main/data/toy-classification-data.csv"
df = pd.read_csv(url)

df.head() # just for visualizing the first few rows

	y	y_pred
0	0	0
1	1	0
2	1	0
3	0	1
4	0	0

You should be able to use your function like this:

# y is the true label, y_pred is the prediction
rate_summary_2(df["y"], df["y_pred"])

The true positive rate is ___.
The false positive rate is ___.
The true negative rate is ___. 
The false positive rate is ___.

Hints

An excellent solution for this part will not use any for-loops. Computing each of the four rates can be performed in a single compact line of code. To begin thinking of how you might do this, you may want to experiment with code like the following:

df[["y"]] == df[["y_pred"]]
df[["y"]].sum(), df[["y"]].sum()

References

Narayanan, Arvind. 2022. “The Limits of the Quantitative Approach to Discrimination.” Speech.