Course Project

Project Description

The course project for CSCI 0451 is an opportunity for you to demonstrate your learning against one or more of the course’s six learning objectives on a topic of your choosing. Here’s the big picture:

Deliverables

There are five deliverables associated with your project:

A project proposal, due at the beginning of Week 7. The purpose of the proposal is for you and your group to carefully outline what you want to work on and explain why it’s feasible. Your proposal will be in the form of a README.md file in a shared GitHub repository that will house your software.
A mid-project update due in Week 9 or 10. This will be a short, informal presentation in which you will share what you’ve done with the class.
The project software, (aka the GitHub repository itself) due at the end of finals week.
A project report in the form of an extended blog post in which you explain what you did, relate it to existing work, and show your experiments or other findings. The report is due at the end of finals week.
A project presentation during Week 12. The presentation will be 7-8 minutes and executed as a group. It should involve a visual aid, usually slides.

I’ll share more detailed information on each of these deliverables later in the course.

Group Work

I expect that most students will complete their projects in teams of 2-3 students. Individual projects and groups of 4 students should seek my permission prior to submitting their project proposal and explain the reason for such a small (or large) group.

Learning Objectives

Remember that we have six learning learning objectives in this course. The project is actually its own objective—that is, part of the course goal is for you to have the experience of initiating and pursuing an idea that you design. The other five objectives are:

Theory
Implementation
Navigation
Experimentation
Social Responsibility

In general, I expect most projects to address at least two of these learning objectives. For example, a project in which you implement and test a new algorithm would address Theory, Implementation, and Experimentation. A project in which you work with a data set that you care about on a learning task using existing tools could address Navigation and Experimentation. A project in which you replicated the findings of a recent study on algorithmic bias could address Experimentation and Social Responsibility. There are lots of valid possibilities. Your project proposal will address which of these learning objectives your project will address, and your final report will describe what you learned under each objective.

What Makes a Good Project?

Big Picture

There’s a lot more detail on this topic below, but there are two simple questions that you should ask yourselves when envisioning your project:

Will I learn something by completing this project?
Will I be proud of this project once it’s done?

If the answer to both questions is “yes,” then your overall project idea is likely good. Feel free to approach me early if you want to talk over whether your project idea is suitable for the course.

Boring Projects

There is a kind of machine learning project that I feel is very boring and doesn’t really teach all that much. I’ll call these “Kaggle-style” projects. A “Kaggle-style” project is a project that starts with a convenient, clean data set and ends with a test score. KS projects don’t clean or explore the data; don’t implement new algorithms; and don’t think carefully about why one algorithm might be better than another for the data in question. Instead, they simply try a bunch of things and assess them with a validation or test score. I will probably not be impressed with a Kaggle-style project, partly because these kinds of projects really only address the “Navigation” learning objective.

The website Kaggle is famous for hosting machine learning competitions in which the goal is to train a model that achieves the best prediction score on test data.I am being a little unfair; some Kaggle submissions are of exceptionally high quality.

It’s fine (indeed, encouraged!) for you to find some data that interests you and apply machine learning methods to it in order to make predictions or understand the structure of the data. To deepen your project beyond “Kaggle style,” you can incorporate some or all of the following:

Work with data that is messy and needs significant processing before it can be used for ML tasks.
Design a custom vectorization scheme (i.e. way of representing each data points as a vector), or experiment with several pre-implemented schemes.
Construct multiple visualizations of your data that highlight patterns you wish to model or questions that you wish to explore.
Conduct a careful audit of your model to understand whether it performs better in some situations than others.

Critical Discussion

One thing that should be incorporated into both your proposal and your project writeup is a critical discussion of incentives and impacts in your model.

Incorporate a critical discussion of incentives and impacts in your work.

If someone were paying you to develop this model, who would be paying and why? Why might someone want this model to be built? Are you comfortable with that?
Who are the users of your work? Who could be affected by your work? Are these populations the same?
Are there risks of substantial bias or harm associated with the work you produce?

Ideas

Here are some suggestions for choosing project directions. It’s fine for your project to be something entirely different—indeed, I encourage it! The primary benefit of projects that I come up with is that they are more likely to “work.” The primary drawback is that they may not be what you’re interested in! Even projects that don’t fully meet their stated objectives can still be successful experiences that demonstrate learning for the course.

Theory and Implementation

If you enjoy thinking about math and how to translate math into performant numerical code, then you may wish to consider implementing a machine learning algorithm that we haven’t implemented in blog posts. There are many candidates, and it can be partially up to you to decide. Since the project is bigger than a blog post…

…Your implementation should likely be of a more complex algorithm than ones we implemented in blog posts already. Alternatively, you could implement several related algorithms.
…You will likely need to perform more complicated experiments in order to demonstrate the performance of your implementation.

Because these projects involve theoretical content that we haven’t covered in class, it is a good idea to talk with me before committing to one of them.

Support Vector Machine

Support vector machines (SVMs) were among the state-of-the-art binary classifiers before the rise of deep neural networks. Support vector machines are convex linear classifiers like logistic regression, but have some special mathematical properties that enable them to make much faster predictions on new data by leveraging sparsity. A good SVM project would likely involve most or all of the following:

Implementing SVM using a version of stochastic gradient descent called PEGASOS (Fig. 2 in Shalev-Shwartz, Singer, and Srebro (2007)).
Implementing kernel SVM, which enables nonlinear decision boundaries, using a quadratic solver.
Testing your results on several real and synthetic data sets for:
- Runtime of training.
- Runtime of prediction.
- Accuracy of prediction.

Faster Gradient Descent

Gradient descent and its relatives apply to a wide range of machine learning algorithms. Gradient descent comes in many flavors:

Stochastic
Momentum
Accelerated
Primal-dual methods
Modified gradient methods for nonconvex problems
Newton methods

etc. etc. One good project could be to implement several of these methods for one or more ML algorithms, comparing the runtime of the training step on a variety of data sets.

Other

I can throw you lots of other theoretical/implementation problems if you’re interested—come chat and we can discuss.

Applied Analysis Projects

You may wish to take machine learning methods that we discussed in class and apply them to data about some topic that you care about. Some great project interests that I’ve heard include remote sensing, sports analytics, genome analysis, and text analysis/language modeling.

Audits and Algorithmic Bias

Many of you may have an interest in further exploring topics related to fairness and bias in machine learning algorithms. This is a great topic! Some possibilities include:

Thoroughly reproducing the findings of papers that diagnose algorithmic bias with an available data set or algorithm, such as Obermeyer et al. (2019).
Write a critical review essay on the topic of algorithmic bias in a specific area. This should be a polished essay with lots of references and some specific, quantitative examples.

References

Obermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science 366 (6464): 447–53.

Shalev-Shwartz, Shai, Yoram Singer, and Nathan Srebro. 2007. “Pegasos: Primal Estimated Sub-Gradient Solver for Svm.” In Proceedings of the 24th International Conference on Machine Learning, 807–14.