from matplotlib import pyplot as plt
import numpy as np
= plt.subplots(1, 1)
fig, ax = np.linspace(-1, 1, 101)
y_hat
= lambda y_hat, y: 1 - 1*(y_hat*y > 0)
loss
set(xlabel = r"$\hat{y}$",
ax.= r"$\ell(\hat{y}, y)$")
ylabel
1)) ax.plot(y_hat, loss(y_hat,
Warmup Exercises
\[ \newcommand{\R}{\mathbb{R}} \newcommand{\vx}{\mathbf{x}} \newcommand{\vw}{\mathbf{w}} \newcommand{\vz}{\mathbf{z}} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\bracket}[1]{\langle #1 \rangle} \newcommand{\abs}[1]{\lvert #1 \rvert} \newcommand{\paren}[1]{\left( #1 \right)} \]
Perceptron
Part 1
Sketch the line in \(\R^2\) described by the equation \[ \bracket{\vw, \vx} = b\;, \tag{1}\]
where \(\vw = \paren{1, -\frac{1}{2}}^T \in \R^2\) and \(b = \frac{1}{2}\). Here, \(\bracket{\vw, \vx} = \sum_{i = 1}^n w_i x_i\) is the inner product (or dot product) between the vectors \(\vw\) and \(\vw\).
Part 2
Write a quick Python function called perceptron_classify(w, b, x)
. w
and x
should both be 1d numpy arrays of the same length, and b
should be a scalar. Your function should return 0
if \(\bracket{\vw, \vx} < b\) and 1
if \(\bracket{\vw, \vx} \geq b\).
for
-loop nor an if
-statement.Verify that your function works on a few simple examples.
Part 3
Consider a line of the general form of Equation 1. Let’s allow \(\vx\) and \(\vw\) to both be \(n\)-dimensional, so that this equation defines a hyperplane in \(\R^n\). Suppose that we wanted to represent the same hyperplane in \(\R^n\) using an equation of the form
\[ \bracket{\tilde{\vw}, \tilde{\vx}} = 0\;. \tag{2}\]
for some \(\tilde{\vw} \in \R^{n+1}\). Define \(\tilde{\vx} = (\vx, 1)\). How could you define \(\tilde{\vw}\) to make Equation 2 equivalent to Equation 1?
Convexity
As you learned in Daumé, informally, a convex function is a function that is “bowl-shaped.” Hardt and Recht give the formal definition, which has the benefit of applying to functions of many variables.
Part 1
Consider the 0-1 step function that I’ve plotted below:
Show pictorially that this function is not convex. No proof needed – just the right drawing.
Part 2: Second Derivative Test
Another way to tell whether a function is convex is to check its second derivative. If a function \(f:S\rightarrow \R\) has a convex domain \(S\subseteq \R\), if \(f\) is everywhere twice-differentiable, and if \(\frac{d^2f(z_0)}{dz^2} > 0\) for all \(z_0 \in S\), then \(f\) is convex.
Use the second derivative test to check that the following two functions are convex:
\[ \begin{aligned} f(z) &= - \log z \\ g(z) &= - \log(1-z)\;. \end{aligned} \]
Part 3: Plotting Practice
In a Jupyter notebook, write a simple program to plot each of the functions \(f\) and \(g\) from Part 2. Some of the Part 1 code is likely to help you.
Part 4: Convexity in Many Variables
Recall the Hardt and Recht definition of convexity: a function \(f:\R^p \rightarrow \R\) is convex if, for any \(\lambda \in [0,1]\) and any points \(\vz_1, \vz_2 \in \R^p\),
\[ f(\lambda \vz_1 + (1-\lambda)\vz_2) \leq \lambda f(\vz_1) + (1-\lambda)f(\vz_2)\;. \]
Using this definition, write a short mathematical proof that the function \(f(\vz) = \norm{\vz} = \sqrt{\bracket{\vz, \vz}}\) is convex. You will want to use the triangle inequality, which says that \(\norm{\vz_1 + \vz_2} \leq \norm{\vz_1} + \norm{\vz_2}\).
Gradient Descent
Consider the quadratic function \(g(z) = \frac{1}{2}az^2 + bz + c\).
- Prove that \(g\) has a critical point at the point \(z^* = -\frac{b}{a}\) (hint: solve \(g'(z^*) = 0\)).
- What must be true about the constants \(a\), \(b\), and \(c\) to ensure that this point is a local minimum of \(g\)? (Hint: second derivative test).
- Suppose now that we are able to evaluate the function \(g\), as well as its derivative \(g'\), but not able to use algebra to find \(z^*\) (this mirrors our situation in most practical problems). Instead, we are going to use the following algorithm to attempt to approximate \(z^*\):
- Begin with some initial guess \(z^{(0)}\).
- In each time-step \(t\), compute \(z^{(t+1)} \gets z^{(t)} - \alpha g'(z^{(t)})\), where \(\alpha > 0\) is the learning rate.
- In practice we would need to specify a stopping criterion, but for this theoretical problem we don’t need to worry about it.
- Using algebra, prove that for any timestep \(t\), \[ (z^* - z^{(t+1)})^2 = (a\alpha - 1)^2(z^* - z^{(t)})^2\;. \]
- Let’s think of \(\abs{z^* - z^{(t)}}\) as the error in our current estimate \(z^{(t)}\). Using the recurrence above, conclude that, for any \(t\), the error \(\abs{z^* - z^{(t)}}\) satisfies \[ \abs{z^* - z^{(t)}} = \abs{a\alpha - 1}^{t}\abs{z^* - z^{(0)}}\;. \]
- For \(\alpha \in (\alpha_*, \alpha^*)\), we are guaranteed that the error \(\abs{z^* - z^{(t)}}\rightarrow 0\) as \(t\rightarrow \infty\). What are \(\alpha_*\) and \(\alpha^*\)?
- Suppose that \(\alpha\) is within the necessary range. I want to guarantee that \(\abs{z^* - z^{(t)}} < \epsilon\) for some small \(\epsilon > 0\) (in practice we often call this the tolerance). Conclude that the number of steps necessary to reach this tolerance is no greater than \[ \bar{t} = \frac{ \log \epsilon - \log \abs{z^* - z^{(0)}}}{\log \abs{a\alpha - 1}}\;. \]
Ignoring constants with respect to \(\epsilon\), we say that this algorithm for finding the minimum of \(g\) with tolerance \(\epsilon\) has a \(\log \epsilon\) a convergence rate.
Gradient Descent (Again)
Consider the function \(f(w_0, w_1) = \sin(w_0w_1)\). You can define this function like this:
import numpy as np
def f(w):
return np.sin(w[0]*w[1])
Mathematically, the gradient of this function is
\[\nabla f(w_0, w_1) = (w_1\cos w_0w_1, w_0 \cos w_0w_1)^T.\]
- Implement a simple loop that uses gradient descent to find a minimum of this function.
- You’ll have to choose the learning rate \(\alpha\).
- The
np.cos()
function will be useful for programming the gradient. - It’s not the fastest approach, but if you’re not show how to program the gradient you can always first implement it as a list of two floats, and then use
np.array(my_list)
to convert it into a numpy array. - You’ll also need to pick a random starting guess.
- Find two initial guesses for the parameter vector \(\vw\) such that you get two different final minimizers (this is possible because \(f\) is not convex).
Overfitting and the Scientific Method
In the scientific method, it is often emphasized that we need to formulate a hypothesis before performing an experiment. It’s fine for the hypothesis to be based on previous experiments. However, the scientific method never allows us to perform an experiment, formulate a hypothesis, and then say that the experiment supported the (new) hypothesis.
We can think of scientific theories as systems of thought that help us make predictions about new phenomena. With this in mind, please write a short paragraph explaining the importance of hypothesis-first science using the language of machine learning. In your explanation, please use the following vocabulary:
- Training data.
- Training accuracy.
- Validation/testing data.
- Validation/testing accuracy.
- Overfitting.
The Coin-Flipping Game
Let’s play a game! Here is the setup:
I have a coin with probability of heads equal to \(p \in [0,1]\). I am going to ask you to pick a number \(\hat{p} \in [0,1]\). Then, I flip my coin.
- If my coin comes up heads, you give me \(-\log \hat{p}\) dollars.
- If my coin comes up tails, you give me \(-\log (1-\hat{p})\) dollars.
Part 1
Compute the expected amount of money you will give me when we play this game in terms of \(p\) and \(\hat{p}\). Call this quantity \(R(\hat{p}, p)\). This is the risk of the guess \(\hat{p}\).
Part 2
Suppose I tell you the value of \(p\). Write a mathematical proof to show that your best choice of \(\hat{p}\) (the one that loses you the least money) is \(\hat{p} = p\).
Part 3
Now suppose that I don’t tell you the true value of \(p\). Instead, I let you observe \(n\) coin flips before asking you to make your guess. Describe:
- A suggestion for choosing \(\hat{p}\) based only on the results of the previous flips.
- A way to estimate the risk (expected amount of money lost) based only on the results of the previous flips.
Your answer should depend on \(\hat{p}\) but not on \(p\)!
Balancing Classification Rates
Use the code from our recent lecture to download the Titanic data set as a Pandas data frame and train a model on the training data. Then download the test data. Compute y_pred
, the vector of predictions of your model on the test data.
Then, write a function that verifies eq. (2.6) in Alexandra Chouldechova’s paper “Fair Prediction with disparate impact.” Here’s what your function should do:
- Given vectors
y_pred
of predictions andy_test
of actual labels, compute the False Negative Rate (FNR), False Positive Rate (FPR), prevalence \(p\), and positive predictive value (PPV). - Return as a tuple the lefthand side and righthand side of eq. (2.6) in Chouldechova.
- Verify that the two numbers are equal!
Limits of The Quantitative Approach to Discrimination
I’ll give you each a number in Slack. The numbers correspond to the following sections of Narayanan (2022). These are:
- The null hypothesis allocates the burden of proof (p. 7-8)
- Compounding inequality is far below the radar of quantitative methods (p. 9-10)
- Snapshot datasets hide discrimination (p. 10-11)
- Explaining away discrimination (p. 12-13)
- What counts as evidence is a subjective choice (p. 5-7)
For your assigned section, please write a short paragraph (4-5 simple sentences is fine). You should:
- Summarize Narayanan’s key points in that section.
- In one of the sentences, describe which aspects of the Uber case study (p. 13-16) reflect the ideas of the section you described.
Bring your paragraph in class and be ready to read it to your group.
Vectorization Brainstorm
In a recent lecture, we discussed methods of vectorizing text like the document-term matrix that use the bag of words assumption: the order of the words doesn’t matter!
Take some time and propose an alternative approach to word-based text vectorization. Can you find a scheme that would give different vector representations to the following two sentences?
“I love rabbits, not cats.” “I love cats, not rabbits.”
You don’t have to implement your vectorization, but please be prepared to write pseudocode for your group to show in detail how you would perform the vectorization.
Image Compression Factor of K-Means
In today’s reading on K-means clustering from the Python Data Science Handbook, Jake VanderPlas considers the use of K-means to reduce the number of distinct colors in an image (Example 2). I encourage you to run the code for this example while thinking about this warmup!
Give an estimate of the compression factor: the reduction of information achieved when compressing an image using k-means clustering into \(k\) color clusters. The compression factor is the number of bits required to store the compressed image, divided by the number of bits required to store the original image. Both of these numbers can be computed asymptotically (i.e. with big-oh reasoning) in order to simplify the analysis.
There are multiple good ways to think about this question, and you’re welcome to choose one that makes sense to you as long as you carefully state your steps and assumptions. Here are a few points that I find helpful:
Bits in Original Image
- An image with \(n\) rows and \(m\) columns has \(nm\) pixels.
- Each pixel has one of three RGB color channels (Red, Green, and Blue).
- Each color channel can be represented with 8 bits (which encode an integer between 0 and 255, denoting the color intensity in that channel).
Bits in Compressed Image
- If I compress an image into just \(k\) distinct colors, then instead of storing the full RGB value for each pixel, I can just store enough bits to uniquely identify the cluster containing each pixel. How many bits do I need for this?
- I also need to store a dictionary (hash map) that associates color \(j\) (i.e. the centroid of the \(j\)th cluster of colors) to its RGB value.
Optional Extra
Try running the code above while varying the number of clusters. Do you think that a 16-color compression looks much better than an 8-color compression. Do you think the difference is good enough to justify approximately twice the storage? What about 32 colors vs. 16?
Introducing Tensors
First, install PyTorch 2.0 into your ml-0451
Anaconda environment.
The best way to install PyTorch is is probably to run the following at the command line:
conda activate ml-0451
pip3 install torch torchvision torchaudio
Then, in a blank Jupyter notebook, copy, paste, and run each of the code blocks in the first section of the PyTorch tutorial.
Finally, write down a single-sentence answer to each of the following questions:
- In what ways is a PyTorch tensor similar to a Numpy array?
- In what ways is a PyTorch tensor different from a Numpy array?
- What is the primary motivation for the use of a specialized tensor data type, rather than an array, for deep learning?
Efficient Differentiation
Consider the following function: \[ L(w, b) = \frac{1}{2} \left(y - \sigma(wx + b)\right)^2 + \frac{1}{2}\lambda w^2\;, \]
where \(\sigma(a) = \frac{1}{1 + e^{-a}}\).
This is the loss function that would be obtained when using a single feature \(x\) to predict \(y\), using the function \(\sigma(wx + b)\) the predictor and measuring the quality of this predictor using the square-error loss function. Our aim is to compute the gradient of \(L\) with respect to \(w\) and \(b\), with a “ridge” regularization term \(\frac{1}{2}\lambda w^2\) that encourages the weight \(w\) to be small.
The gradient of \(L\) is \(\nabla L (w, b) = \left(\frac{\partial L}{\partial w}, \frac{\partial L}{\partial b}\right)\), where
\[ \begin{aligned} \frac{\partial L}{\partial w} &= (\sigma(wx + b) - y)\sigma(wx+b)(1 - \sigma(wx+b))x + \lambda w \\ \frac{\partial L}{\partial b} &= (\sigma(wx + b) - y)\sigma(wx+b)(1 - \sigma(wx+b))\;. \end{aligned} \tag{3}\]
What You Should Do
Assume that each of the following operations cost one computational unit:
- Multiplying or dividing two scalar numbers.
- Adding or subtracting two scalar numbers.
- Computing an exponential like \(e^a\).
Using this assumption:
- Determine the number of computational units (i.e. computational cost) of computing the gradient of \(L\) exactly as written in Equation 3, under the assumption that you are not allowed to store the values of any intermediate computations.
- Now determine the computational cost of computing the gradient of \(L\) under the assumption that you are allowed to store intermediate computations. Please describe both the number of computations and the number of floating point numbers that must be stored.
- Finally, determine the computational cost in terms of both steps and storage to compute \(L\) using the backpropagation algorithm (described for a very similar function in Hegde’s notes).
Compare your results from each method.
Convolutional Kernels
Implement kernel convolution. Your implementation should accept a 2d array X
(think of X
as representing a greyscale image) and a square convolutional kernel K
. Your implementation should operate using pure numpy
. You can use any zero-padding strategy, but you do need to explain what your strategy is when presenting.
It’s ok to use a for-loop to loop over pixels.
Here’s an example image you can use:
from PIL import Image
import urllib
import numpy as np
from matplotlib import pyplot as plt
def read_image(url):
return np.array(Image.open(urllib.request.urlopen(url)))
= "https://i.pinimg.com/originals/0e/d0/23/0ed023847cad0d652d6371c3e53d1482.png"
url
= read_image(url)
img
def to_greyscale(im):
return 1 - np.dot(im[...,:3], [0.2989, 0.5870, 0.1140])
= to_greyscale(img)
img
= "Greys")
plt.imshow(img, cmap "off") plt.gca().axis(
(-0.5, 639.5, 412.5, -0.5)
After implementing your function, you should be able to use it like this, replacing the implementation of scipy.signal
with your own implementation. The result should look something like this:
from scipy.signal import convolve2d
= np.array([[-1, -1, -1],
kernel -1, 8, -1],
[-1, -1, -1]])
[
= convolve2d(img, kernel)
convd
= "Greys", vmin = 0, vmax = 8)
plt.imshow(convd, cmap "off") plt.gca().axis(
(-0.5, 641.5, 414.5, -0.5)
Project Check-In
This is a warmup activity that we will repeat weekly until the end of the semester.
Prepare a 2-3 minute “presentation” of your project to your group. Your presentation can be informal and does not need to have any special visual aids. The primary expectation is that you are able to demonstrate some relevant functionality to your peers. If your project involves coding or data analysis, your relevant functionality might be as simple as accessing or preparing the data. You should plan to demonstrate additional functionality each week.
In other words, you should show your group something that works, regardless of how “big” it is.
If you are doing a project that does not involve implementation (such as a research essay), then you are still expected to offer an update. Your contributions could include describing the sources you’ve found or showing your group an outline of the argument that you will make.
It’s appropriate for each member of your project group to give the same presentation during warmup. Please note that you may not be in the same warmup group as your project partners. This means that:
- The code you show needs to run on your laptop or in your compute instance (e.g. Google Colab).
- You need to be ready to explain what is being shown, even if your project partners did much of the work.
What Needs To Be Learned?
Suppose that you wanted to teach an individual to recognize English-language phishing emails. Write down a few features (based on the linked website, your own experience, or other sources) that you think would help someone classify an email as “phishing attempt or not” based on the text of the email.
Now, imagine that you are going to sit down with your tutee to teach them how to recognize English-language phishing emails. Where would you start your instruction if…
- Your tutee was another member of your warmup group.
- Your tutee was a fluent English speaker but had never used email.
- Your tutee was a regular email user but spoke no English.
- Your tutee spoke no English and had never seen a computer.
Which of these four scenarios would require the most “learning effort?” Which would require the least?
Word Embedding
Take out a sheet of paper and a pencil. Your goal is to place the following words on the sheet of paper in such a way that their location on the sheet is indicative of their relationships to each other. You can decide exactly how to do this. Should words with similar meanings be in the same part of the page? Should pairs of words with similar relationships have similar distances? Your approach is up to you, but please write it down along with your placements. Your words are:
- Woman
- Student
- Nurse
- Doctor
- Man
- Professor
- Model
- Computer
- Machine
- Programmer
Realistic Text?
In our reading on The Unreasonable Effectiveness of Recurrent Neural Networks, there are a few examples of model output that is realistic but not real. Pick one of the examples, and write down as carefully as you are able what makes the generated text realistic. Then, describe what “tells” would tip off an attentive observer that the text isn’t real (generated intentionally by a human) after all.
The Shakespeare and Wikipedia examples might be the easiest ones to think about, but feel free to look at the \(\LaTeX\) or Linux source code examples if you prefer.
Mind Map
Wow, we’ve covered a lot of ground in this class! Use a graphics program or a pen/paper to make a mind map describing some of our main theoretical and concepts. As a reminder, a mind-map is a graph in which the nodes are concepts and edges join related concepts. Please incorporate the following concepts as nodes:
- Loss function
- Target
- Predictor
- Model
- Regression
- Classification
- Empirical risk minimization
- Gradient descent
- Feature map
- Vectorization
- Overfitting
- Training data
- Validation/testing data
- Perceptron
- Logistic regression
- Neural networks
- Linear regression
Additionally, please incorporate at least three other concepts of your own choosing.
Flowcharts in Quarto
Since mind-maps can be a little complicated to organize, you might find it easiest to work with some software. One optional possibility is actually included with Quarto: the Mermaid chart tool can render attractive diagrams that include labeled nodes and directed edges. Using a flowchart is probably the way. For example, inserting the following code into a special {mermaid}
code block will produce the following diagram:
flowchart TB
A(First concept) --> B(Second concept)
B --is part<br> of--> A
B-->C(Third concept)
A-->C
A & B & C -->D(Fourth concept)
flowchart TB A(First concept) --> B(Second concept) B --is part<br> of--> A B-->C(Third concept) A-->C A & B & C -->D(Fourth concept)
For the basics of using Mermaid with Quarto, see the Quarto docs. A benefit of this approach is that you don’t have to worry too much about positioning, and you can publish your mind map easily on your blog! On the other hand, this approach doesn’t give you much flexibility and makes it harder to be creative about incorporating complex relationships. Doing your mind map by hand or in any other software is entirely fine.
Classification Rates
Part 1
COVID-19 rapid tests have approximately an 80% sensitivity rate, which means that, in an individual who truly has COVID-19, the probability of a rapid test giving a positive result is roughly 80%. On the other hand, the probability of a rapid test giving a positive result for an individual who truly does not have COVID-19 is 5%. Suppose that approximately 4% of the population are currently infected with COVID-19.
Write a Python function called rate_summary
that prints the following output, filling in the correct values for each of the specified rates:
= 0.8 # test sensitivity
s = 0.02 # probability of positive test if no COVID
f = 0.05 # fraction of population infected
prevalence
rate_summary(s, f, current_infection)
The true positive rate is ___.
The false positive rate is ___.
The true negative rate is ___.
The false positive rate is ___.
Part 2
- Suppose that scientists found an alternative rapid test which had a 75% sensitivity rate with a 0% chance of a positive test on someone who is truly not infected. Would you suggest replacing the old rapid tests with these alternative tests? Why?
- What if the alternative test had an 85% sensitivity rate and a 10% chance of a positive test on someone who is truly not infected?
Part 3
It’s all well and good to do the math, but what about when we actually have data? Write a function called rate_summary_2
that accepts two columns of a pandas.DataFrame
(or equivalently two one-dimensional numpy.arrays
of equal length). Call these y
and y_pred
. Assume that both y
and y_pred
are binary arrays (i.e. arrays of 0s and 1s). y
represents the true outcome, whereas y_pred
represents the prediction from an algorithm or test. Here’s an example of the kind of data we are thinking about:
import pandas as pd
= "https://github.com/middlebury-csci-0451/CSCI-0451/raw/main/data/toy-classification-data.csv"
url = pd.read_csv(url)
df
# just for visualizing the first few rows df.head()
y | y_pred | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 0 |
2 | 1 | 0 |
3 | 0 | 1 |
4 | 0 | 0 |
You should be able to use your function like this:
# y is the true label, y_pred is the prediction
"y"], df["y_pred"]) rate_summary_2(df[
The true positive rate is ___.
The false positive rate is ___.
The true negative rate is ___.
The false positive rate is ___.
Hints
An excellent solution for this part will not use any for-loops. Computing each of the four rates can be performed in a single compact line of code. To begin thinking of how you might do this, you may want to experiment with code like the following:
"y"]] == df[["y_pred"]]
df[["y"]].sum(), df[["y"]].sum() df[[
© Phil Chodrow, 2023