import pandas as pd
import seaborn as sns
= "https://raw.githubusercontent.com/middlebury-csci-0451/CSCI-0451/main/data/compas-scores-two-years.csv"
compas_url = pd.read_csv(compas_url) compas
Introduction to Bias and Fairness in Classification
$$
Today we are going to study an extremely famous investigation into algorithmic decision-making in the sphere of criminal justice by Angwin et al. (2016), originally written for ProPublica. This investigation significantly accelerated the pace of research into bias and fairness in machine learning, due in combination to its simple message and publicly-available data.
You’ve already read about the COMPAS algorithm in the original article at ProPublica. Our goal today is to reproduce some of the main findings of this article and set the stage for a more systematic treatment of bias and fairness in machine learning.
Parts of these lecture notes are inspired by the original ProPublica analysis and Allen Downey’s expository case study on the same data.
Data Preparation
Let’s first obtain the data. I’ve hosted a copy on the course website, so we can download it using a URL.
For today we are only going to consider a subset of columns.
= ["sex", "race", "decile_score", "two_year_recid"]
cols = compas[cols] compas
We are also only going to consider white (Caucasian) and Black (African-American) defendants:
# boolean vectors (technically, pd.Series)
= compas["race"] == "Caucasian"
is_white = compas["race"] == "African-American"
is_black
= compas[is_white | is_black]
compas = compas.copy() compas
Our data now looks like this:
compas.head()
sex | race | decile_score | two_year_recid | |
---|---|---|---|---|
1 | Male | African-American | 3 | 1 |
2 | Male | African-American | 4 | 1 |
3 | Male | African-American | 8 | 0 |
6 | Male | Caucasian | 6 | 1 |
8 | Female | Caucasian | 1 | 0 |
Preliminary Explorations
Let’s do some quick exploration of our data. How many defendants are present in this data of each sex?
"sex").size() compas.groupby(
sex
Female 1219
Male 4931
dtype: int64
What about race?
"race").size() compas.groupby(
race
African-American 3696
Caucasian 2454
dtype: int64
The decile score is the algorithm’s prediction. Higher decile scores indicate that, according to the COMPAS model, the defendant has higher likelihood to be charged with a crime within the next two years. In the framework we’ve developed in this class, you can think of the decile score as related to quantities like \(\hat{y}_i = \langle \mathbf{w}, \mathbf{x}_i \rangle\), which is a large number when the algorithm has high confidence in predicting a 1
label. Here, a decile score of 10
indicates high confidence in predicting a 1
(= recidivating) label.
The easiest way to see how this looks is with a bar chart, which we can make efficiently using the seaborn
(sns
) package.
= compas.groupby(["race", "decile_score"]).size().reset_index(name = "n")
counts = counts, x = "decile_score", y = "n", hue = "race") sns.barplot(data
<AxesSubplot: xlabel='decile_score', ylabel='n'>
Finally, let’s take a look at the recidivism rate in the data:
"two_year_recid"].mean() compas[
0.4661788617886179
So, in this data, approximately 47% of all defendants went on to be charged of another crime within the next two years. We can also compute the recidivism rate by race:
"race")["two_year_recid"].mean() compas.groupby(
race
African-American 0.514340
Caucasian 0.393643
Name: two_year_recid, dtype: float64
The ProPublica Findings
We’re going to treat the COMPAS algorithm as a binary classifier, but you might notice a problem: the algorithm’s prediction is the decile_score
column, which is not actually a 0
-1
label. Following the analysis of Angwin et al. (2016), we are going to construct a new binary column in which we say that a defendant is predicted_high_risk
if their decile_score
is larger than 4.
"predicted_high_risk"] = (compas["decile_score"] > 4) compas[
Now we have a binary prediction, and we can compute things like confusion matrices:
from sklearn.metrics import confusion_matrix
"two_year_recid"],
confusion_matrix(compas["predicted_high_risk"]) compas[
array([[2129, 1154],
[ 993, 1874]])
We can normalize this confusion matrix to get things like the false positive and false negative rates:
"two_year_recid"],
confusion_matrix(compas["predicted_high_risk"],
compas[= "true") normalize
array([[0.64849223, 0.35150777],
[0.34635507, 0.65364493]])
We see that the algorithm (predicting recidivism if decile_score
is 5 or above) is right about 65% of the time. A bit more specifically, both the true positive (TP) and true negative (TN) rates are approximately 65%. Both the false positive (FP) and false negative (FN) rates are approximately 35%.
We can also check the overall accuracy:
"two_year_recid"] == compas["predicted_high_risk"]).mean() (compas[
0.6508943089430894
The accuracy is relatively consistent even when we break things down by race:
= compas["race"] == "African-American"
black_ix = compas["race"] == "Caucasian"
white_ix
= compas["two_year_recid"] == compas["predicted_high_risk"]
correct_pred
# accuracy on Black defendants
= correct_pred[black_ix].mean()
accuracy_black
# accuracy on white defendants
= correct_pred[white_ix].mean() accuracy_white
However, and this was the main finding of the ProPublica study, the FPR and FNR are very different when we break down the data by race. Here’s the confusion matrix for Black defendants:
"two_year_recid"][black_ix],
confusion_matrix(compas["predicted_high_risk"][black_ix],
compas[= "true") normalize
array([[0.55153203, 0.44846797],
[0.27985271, 0.72014729]])
And here it is for white defendants:
"two_year_recid"][white_ix],
confusion_matrix(compas["predicted_high_risk"][white_ix],
compas[= "true") normalize
array([[0.76545699, 0.23454301],
[0.47722567, 0.52277433]])
The ProPublica study focused on the false positive rate (FPR), which is in the top right corner of the confusion matrices. The FPR of 44% for Black defendants means that, out of every 100 Black defendants who in fact will not commit another crime, the algorithm nevertheless predicts that 44 of them will. In contrast, the FPR of 23% for white defendants indicates that only 23 out of 100 non-recidivating white defendants would be predicted to recidivate.
There are a few ways in which we can think of this result as reflecting bias:
- The algorithm has learned an implicit pattern wherein Black defendants are intrinsically more “criminal” than white defendants, even among people who factually never committed another crime. This is a bias in the patterns that the algorithm has learned in order to formulate its predictions. This is related to representational bias, which we’ll discuss more later in the semester.
- Regardless of how the algorithm forms its predictions, the impact of the algorithm being used in the penal system is that more Black defendants will be classified as high-risk, resulting in more denials of parole, bail, early release, or other forms of freedom from the penal system. So, the algorithm has disparate impact on people. We might claim this as a form of allocative bias: bias in how resources or opportunities (in this case, freedom) are allocated between groups.
In the language of Corbett-Davies et al. (2017), an algorithm that has equal FPRs across two groups satisfies predictive equality with respect to those two groups. So, the COMPAS algorithm fails to possess predictive equality. The idea of error rate balance in Chouldechova (2017) and balance for the positive/negative class in Kleinberg, Mullainathan, and Raghavan (2016) are similar to predictive equality.
In summary, the ProPublica argument was:
Since the FPR differs across racial groups in ways that reinforce the oppression of Black people, the COMPAS algorithm possesses racial bias.
Calibration
Is that the end of the story? Emphatically not! Angwin et al. (2016) kicked off a vigorous discussion about what it means for an algorithm to fair and how to measure deviations from bias. For example, Corbett-Davies et al. (2017) consider a different idea of fairness. While predictive equality requires that the FPRs for white and Black defendants be equal, calibration expresses a different intuition:
A white defendant and a Black defendant who each receive the same score should both have the same risk of recidivating.
Another way to say this is that a score of 7 means the same thing, no matter the race of the defendant.
We can compute the recidivism rates for each race at each decile score using some Pandas .groupby
magic:
= compas.groupby(["race", "decile_score"])["two_year_recid"].mean().reset_index(name = "mean")
means
= means, x = "decile_score", y = "mean", hue = "race") sns.lineplot(data
<AxesSubplot: xlabel='decile_score', ylabel='mean'>
The actual recidivism rate at each risk score is roughly the same between Black and white defendants, especially for decile scores past 5 or so.
Calibration for Binary Classifiers
So far in this course, we have primarily studied binary classifiers that produce a single 0-1 predicted label, rather than a score like a decile. For these classifiers, calibration means that the fraction of predicted recidivists who actually recidivated is the same across groups. If we follow the Angwin et al. (2016) approach and say that the algorithm predicts someone as high risk if their decile score is 4 or above, we would obtain the following results:
"pred_high_risk"] = compas["decile_score"] >= 4
compas[
= compas.groupby(["race", "pred_high_risk"])["two_year_recid"].mean().reset_index(name = "mean")
means
= sns.barplot(data = means, x = "pred_high_risk", y = "mean", hue = "race") p
There are arguments to be had here, but from the perspective of calibration at the decile score threshold of 4, the algorithm might appear to be biased in the other direction: of those who were predicted high risk, slightly more Black than white defendants were arrested within the next two years. In most of the published literature, scholars have considered that the two rates are sufficiently close that we should instead simply say that COMPAS appears to be reasonably well calibrated.
Overcoming Bias?
Ok, so COMPAS is reasonably calibrated, but does not satisfy predictive equality. Couldn’t we just find a way to fix it so that it could be both calibrated and predictively equitable? A little fine-tuning here and there maybe? Sadly, no: this is not just difficult, but actually mathematically impossible, as shown by Chouldechova (2017).
Kleinberg, Mullainathan, and Raghavan (2016) give some other definitions of fairness in algorithmic decision-making, again concluding that several concepts of fairness mathematically exclude other ones.
© Phil Chodrow, 2023