The Adam Algorithm for Optimization

2024-04-25

This is a relatively theoretical blog post on which I am intentionally giving you relatively little guidance. The purpose is to offer a challenge for students who have already done most of the previous blog posts. If you have already completed most of the previous posts, then I encourage you to give this one a try! Otherwise, I suggest that you instead focus on completing some of the prior posts.

The Adam optimization algorithm is a mainstay of modern deep learning. You can think of Adam as fancy gradient descent. It still uses gradient information, but processes that information in a more complex way that often produces state-of-the-art performance in modern large-scale tasks.

Adam was introduced by Kingma and Ba (2015), in a paper which has been cited over 175,000 times.

1 What You Should Do

Read the Adam Paper

Start by reading the paper that introduced Adam. You can focus on sections 1, 2, 3, and 6.1, skipping the others. It’s ok for your first reading to be relatively quick; you’ll want to go back through the sections multiple times as you do your implementation and experiments.

Implement Adam for Logistic Regression

Implement an AdamOptimizer class similar to the optimizers that you have used for several other models in this course. Your optimizer should work for any linear model which has a .grad() method which will return the gradient of the loss function with respect to the parameters. For concreteness, you can use the implementation of the logistic regression that you used in an earlier blog post; your AdamOptimizer will serve as a replacement for the GradientDescentOptimizer.

The authors of the paper frame their algorithm as minimizing a stochastic (random) objective function \(f\). This is a fancy mathematical way to talk about stochastic batch gradient descent. When they talk about evaluating/differentiating the “stochastic function” \(f\), you can instead think of “evaluating/differentiating \(f\) on a subset of the data.” So, you can think of \(g_t\) in their algorithm as the gradient evaluated on a batch of the data. You can still follow the standard structure of stochastic gradient descent in which you loop through the entire data set in one epoch, reshuffle the batches, and do it again.

Your implementation should allow the user to pass five arguments:

batch_size, the batch size for computing gradients. This works in the same way as it did in our initial implementation of stochastic gradient descent, and you can use much of the same code.
alpha, the step-size.
beta_1 and beta_2, the moment estimate decay rates.
w_0, the initial guess for the weight vector. Personally I would suggest giving this argument a default None value and, in the case that the user does not pass w_0, initialize it randomly with the correct dimensions.

Perform a Basic Experiment

Redo some of the simple experiments from your implementation of logistic regression. Compare Adam optimization to standard stochastic gradient descent with a few different parameter choices. Please measure both the number of epochs and the actual amount of time required to achieve convergence.

Perform a Digits Experiment

We saw an example of loading and manipulating the digits data set in the reading on k-means.

Load the digits dataset from scikit-learn. Then, do one of two things:

Filter the digits data set so that it only contains data with two class labels (e.g. 4s vs. 8s).
Optional challenge: extend your implementation of logistic regression to handle multiple class labels. Doing this by hand is quite challenging and I only recommend it if you are feeling very ambitious. You might be able to get some inspiration from this implementation, though please don’t copy code and acknowledge the author if you find her example helpful.

Then, do a version of the experiment in Section 6.1 of the Adam paper in which you again compare the efficiency of your Adam implementation against standard stochastic gradient descent in terms of both epochs and elapsed time.

Perform One More Experiment

Find a data set (any data set that interests you) on which you can perform classification with your Adam model. I recommend that the number of features not be very large (maybe no larger than 100). Then, again compare the performance of Adam to standard stochastic gradient descent on the data you found.

Write Your Post

In your written blog post, please:

Show your Adam implementation. Include comments corresponding to the comments in Alg. 1 of the paper so that the reader understands which lines of code correspond to which lines of math.
Describe and discuss the findings from your experiments.
Don’t forget to label your axes!!
Include an introductory section describing the purpose of the blog post and a summary section reflecting on your findings. In your summary paragraph, please include answers to the following questions:
- In the authors’ assessment, what aspects of the Adam algorithm help it to converge so efficiently? (You don’t need to go deep into the math but you should be able to identify a qualitative features from their paper).
- Did you observe highly efficient performance in your experiments when compared to other methods? If not, can you offer a hypothesis about why?
- How did the experience of implementing this algorithm compare to the experience of implementing standard stochastic gradient descent?

References

Kingma, Diederik P, and Jimmy Lei Ba. 2015. “Adam: A Method for Stochastic Gradient Descent.” In ICLR: International Conference on Learning Representations, 1–15. ICLR US.