Introduction to Bias and Fairness in Classification

Author

Phil Chodrow

Today we are going to study an extremely famous investigation into algorithmic decision-making in the sphere of criminal justice by Angwin et al. (2016), originally written for ProPublica. This investigation significantly accelerated the pace of research into bias and fairness in machine learning, due in combination to its simple message and publicly-available data.

It’s helpful to look at a sample form used for feature collection in the COMPAS risk assessment.

You’ve already read about the COMPAS algorithm in the original article at ProPublica. Our goal today is to reproduce some of the main findings of this article and set the stage for a more systematic treatment of bias and fairness in machine learning.

Parts of these lecture notes are inspired by the original ProPublica analysis and Allen Downey’s expository case study on the same data.

Data Preparation

Let’s first obtain the data. I’ve hosted a copy on the course website, so we can download it using a URL.

import pandas as pd
import seaborn as sns
compas_url = "https://raw.githubusercontent.com/middlebury-csci-0451/CSCI-0451/main/data/compas-scores-two-years.csv"
compas = pd.read_csv(compas_url)

For today we are only going to consider a subset of columns.

cols = ["sex", "race", "decile_score", "two_year_recid"]
compas = compas[cols]

We are also only going to consider white (Caucasian) and Black (African-American) defendants:

# boolean vectors (technically, pd.Series)
is_white = compas["race"] == "Caucasian"
is_black = compas["race"] == "African-American"

compas = compas[is_white | is_black]
compas = compas.copy()

Our data now looks like this:

compas.head()

	sex	race	decile_score	two_year_recid
1	Male	African-American	3	1
2	Male	African-American	4	1
3	Male	African-American	8	0
6	Male	Caucasian	6	1
8	Female	Caucasian	1	0

Preliminary Explorations

Let’s do some quick exploration of our data. How many defendants are present in this data of each sex?

compas.groupby("sex").size()

sex
Female    1219
Male      4931
dtype: int64

What about race?

compas.groupby("race").size()

race
African-American    3696
Caucasian           2454
dtype: int64

The decile score is the algorithm’s prediction. Higher decile scores indicate that, according to the COMPAS model, the defendant has higher likelihood to be charged with a crime within the next two years. In the framework we’ve developed in this class, you can think of the decile score as related to quantities like \(\hat{y}_i = \langle \mathbf{w}, \mathbf{x}_i \rangle\), which is a large number when the algorithm has high confidence in predicting a 1 label. Here, a decile score of 10 indicates high confidence in predicting a 1 (= recidivating) label.

The easiest way to see how this looks is with a bar chart, which we can make efficiently using the seaborn (sns) package.

counts = compas.groupby(["race", "decile_score"]).size().reset_index(name = "n")
sns.barplot(data = counts, x = "decile_score", y = "n", hue = "race")

<AxesSubplot: xlabel='decile_score', ylabel='n'>

Finally, let’s take a look at the recidivism rate in the data:

compas["two_year_recid"].mean()

0.4661788617886179

So, in this data, approximately 47% of all defendants went on to be charged of another crime within the next two years. We can also compute the recidivism rate by race:

compas.groupby("race")["two_year_recid"].mean()

race
African-American    0.514340
Caucasian           0.393643
Name: two_year_recid, dtype: float64

The ProPublica Findings

We’re going to treat the COMPAS algorithm as a binary classifier, but you might notice a problem: the algorithm’s prediction is the decile_score column, which is not actually a 0-1 label. Following the analysis of Angwin et al. (2016), we are going to construct a new binary column in which we say that a defendant is predicted_high_risk if their decile_score is larger than 4.

compas["predicted_high_risk"] = (compas["decile_score"] > 4)

Now we have a binary prediction, and we can compute things like confusion matrices:

from sklearn.metrics import confusion_matrix
confusion_matrix(compas["two_year_recid"], 
                 compas["predicted_high_risk"])

array([[2129, 1154],
       [ 993, 1874]])

We can normalize this confusion matrix to get things like the false positive and false negative rates:

confusion_matrix(compas["two_year_recid"], 
                 compas["predicted_high_risk"],
                 normalize = "true")

array([[0.64849223, 0.35150777],
       [0.34635507, 0.65364493]])

We see that the algorithm (predicting recidivism if decile_score is 5 or above) is right about 65% of the time. A bit more specifically, both the true positive (TP) and true negative (TN) rates are approximately 65%. Both the false positive (FP) and false negative (FN) rates are approximately 35%.

We can also check the overall accuracy:

(compas["two_year_recid"] == compas["predicted_high_risk"]).mean()

0.6508943089430894

The accuracy is relatively consistent even when we break things down by race:

black_ix = compas["race"] == "African-American"
white_ix = compas["race"] == "Caucasian"

correct_pred = compas["two_year_recid"] == compas["predicted_high_risk"]

# accuracy on Black defendants
accuracy_black = correct_pred[black_ix].mean()

# accuracy on white defendants
accuracy_white = correct_pred[white_ix].mean()

However, and this was the main finding of the ProPublica study, the FPR and FNR are very different when we break down the data by race. Here’s the confusion matrix for Black defendants:

confusion_matrix(compas["two_year_recid"][black_ix], 
                 compas["predicted_high_risk"][black_ix],
                 normalize = "true")

array([[0.55153203, 0.44846797],
       [0.27985271, 0.72014729]])

And here it is for white defendants:

confusion_matrix(compas["two_year_recid"][white_ix], 
                 compas["predicted_high_risk"][white_ix],
                 normalize = "true")

array([[0.76545699, 0.23454301],
       [0.47722567, 0.52277433]])

The ProPublica study focused on the false positive rate (FPR), which is in the top right corner of the confusion matrices. The FPR of 44% for Black defendants means that, out of every 100 Black defendants who in fact will not commit another crime, the algorithm nevertheless predicts that 44 of them will. In contrast, the FPR of 23% for white defendants indicates that only 23 out of 100 non-recidivating white defendants would be predicted to recidivate.

There are a few ways in which we can think of this result as reflecting bias:

The algorithm has learned an implicit pattern wherein Black defendants are intrinsically more “criminal” than white defendants, even among people who factually never committed another crime. This is a bias in the patterns that the algorithm has learned in order to formulate its predictions. This is related to representational bias, which we’ll discuss more later in the semester.
Regardless of how the algorithm forms its predictions, the impact of the algorithm being used in the penal system is that more Black defendants will be classified as high-risk, resulting in more denials of parole, bail, early release, or other forms of freedom from the penal system. So, the algorithm has disparate impact on people. We might claim this as a form of allocative bias: bias in how resources or opportunities (in this case, freedom) are allocated between groups.

Sometimes predictive equality is also defined to require that the false negative rates (FNRs) be equal across the two groups as well.

In the language of Corbett-Davies et al. (2017), an algorithm that has equal FPRs across two groups satisfies predictive equality with respect to those two groups. So, the COMPAS algorithm fails to possess predictive equality. The idea of error rate balance in Chouldechova (2017) and balance for the positive/negative class in Kleinberg, Mullainathan, and Raghavan (2016) are similar to predictive equality.

In summary, the ProPublica argument was:

Since the FPR differs across racial groups in ways that reinforce the oppression of Black people, the COMPAS algorithm possesses racial bias.

Calibration

Is that the end of the story? Emphatically not! Angwin et al. (2016) kicked off a vigorous discussion about what it means for an algorithm to fair and how to measure deviations from bias. For example, Corbett-Davies et al. (2017) consider a different idea of fairness. While predictive equality requires that the FPRs for white and Black defendants be equal, calibration expresses a different intuition:

A white defendant and a Black defendant who each receive the same score should both have the same risk of recidivating.

Another way to say this is that a score of 7 means the same thing, no matter the race of the defendant.

Compare: an “A” in CS 201 means the same thing for your future success in CS, no matter your gender.

We can compute the recidivism rates for each race at each decile score using some Pandas .groupby magic:

means = compas.groupby(["race", "decile_score"])["two_year_recid"].mean().reset_index(name = "mean")

sns.lineplot(data = means, x = "decile_score", y = "mean", hue = "race")

<AxesSubplot: xlabel='decile_score', ylabel='mean'>

The actual recidivism rate at each risk score is roughly the same between Black and white defendants, especially for decile scores past 5 or so.

Calibration for Binary Classifiers

So far in this course, we have primarily studied binary classifiers that produce a single 0-1 predicted label, rather than a score like a decile. For these classifiers, calibration means that the fraction of predicted recidivists who actually recidivated is the same across groups. If we follow the Angwin et al. (2016) approach and say that the algorithm predicts someone as high risk if their decile score is 4 or above, we would obtain the following results:

compas["pred_high_risk"] = compas["decile_score"] >= 4

means = compas.groupby(["race", "pred_high_risk"])["two_year_recid"].mean().reset_index(name = "mean")

p = sns.barplot(data = means, x = "pred_high_risk", y = "mean", hue = "race")

There are arguments to be had here, but from the perspective of calibration at the decile score threshold of 4, the algorithm might appear to be biased in the other direction: of those who were predicted high risk, slightly more Black than white defendants were arrested within the next two years. In most of the published literature, scholars have considered that the two rates are sufficiently close that we should instead simply say that COMPAS appears to be reasonably well calibrated.

Overcoming Bias?

Ok, so COMPAS is reasonably calibrated, but does not satisfy predictive equality. Couldn’t we just find a way to fix it so that it could be both calibrated and predictively equitable? A little fine-tuning here and there maybe? Sadly, no: this is not just difficult, but actually mathematically impossible, as shown by Chouldechova (2017).

Kleinberg, Mullainathan, and Raghavan (2016) give some other definitions of fairness in algorithmic decision-making, again concluding that several concepts of fairness mathematically exclude other ones.

References

Angwin, Julia, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. “Machine Bias.” In Ethics of Data and Analytics, 254–64. Auerbach Publications.

Chouldechova, Alexandra. 2017. “Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments.” Big Data 5 (2): 153–63.

Corbett-Davies, Sam, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. 2017. “Algorithmic Decision Making and the Cost of Fairness.” In Proceedings of the 23rd Acm SIGKDD International Conference on Knowledge Discovery and Data Mining, 797–806.

Kleinberg, Jon, Sendhil Mullainathan, and Manish Raghavan. 2016. “Inherent Trade-Offs in the Fair Determination of Risk Scores.” arXiv Preprint arXiv:1609.05807.