In this blog post, you’ll create a machine learning model that predicts an individual characteristic like employment status or income on the basis of other demographic characteristics. You’ll then perform a fairness audit in order to assess whether or not your algorithm displays bias with respect demographic characteristics like race or sex, and discuss your findings.
Published
March 29, 2023
$$
$$
This one of two possible blog posts for this week. This blog post involves working with data and performing audits. If you’d rather write a research essay on the limitations of the quantitative approach to analyzing bias and discrimination, see the alternative assignment.
The folktables package allows you to download and neatly organize data from the American Community Survey’s Public Use Microdata Sample (PUMS). You can install it in your ml-0451 environment by running the following two commands in your terminal:
conda activate ml-0451pip install folktables
You can learn more about the folktables package, including documentation and examples, on the package’s GitHub page.
In this blog post, you’ll fit a classifier using data from folktables and perform a bias audit for the algorithm.
1 Using folktables
The first thing to do is to download some data! Here’s an illustration of downloading a complete set of PUMS data for the state of Alabama.
from folktables import ACSDataSource, ACSEmployment, BasicProblem, adult_filterimport numpy as npSTATE ="AL"data_source = ACSDataSource(survey_year='2018', horizon='1-Year', survey='person')acs_data = data_source.get_data(states=[STATE], download=True)acs_data.head()
RT
SERIALNO
DIVISION
SPORDER
PUMA
REGION
ST
ADJINC
PWGTP
AGEP
...
PWGTP71
PWGTP72
PWGTP73
PWGTP74
PWGTP75
PWGTP76
PWGTP77
PWGTP78
PWGTP79
PWGTP80
0
P
2018GQ0000049
6
1
1600
3
1
1013097
75
19
...
140
74
73
7
76
75
80
74
7
72
1
P
2018GQ0000058
6
1
1900
3
1
1013097
75
18
...
76
78
7
76
80
78
7
147
150
75
2
P
2018GQ0000219
6
1
2000
3
1
1013097
118
53
...
117
121
123
205
208
218
120
19
123
18
3
P
2018GQ0000246
6
1
2400
3
1
1013097
43
28
...
43
76
79
77
80
44
46
82
81
8
4
P
2018GQ0000251
6
1
2701
3
1
1013097
16
25
...
4
2
29
17
15
28
17
30
15
1
5 rows × 286 columns
There are approximately 48,000 rows of PUMS data in this data frame. Each one corresponds to an individual citizen of the given STATE who filled out the 2018 edition of the PUMS survey. You’ll notice that there are a lot of columns. In the modeling tasks we’ll use here, we’re only going to focus on a relatively small number of features. Here are all the possible features I suggest you use:
For documentation on what these features mean, you can consult the appendix of the paper that introduced the package.
For a few examples:
ESR is employment status (1 if employed, 0 if not)
RAC1P is race (1 for White Alone, 2 for Black/African American alone, 3 and above for other self-identified racial groups)
SEX is binary sex (1 for male, 2 for female)
DEAR, DEYE, and DREM relate to certain disability statuses.
Let’s consider the following task: we are going to
Train a machine learning algorithm to predict whether someone is currently employed, based on their other attributes not including race, and
Perform a bias audit of our algorithm to determine whether it displays racial bias.
First, let’s subset the features we want to use:
features_to_use = [f for f in possible_features if f notin ["ESR", "RAC1P"]]
Now we can construct a BasicProblem that expresses our wish to use these features to predict employment status ESR, using the race RAC1P as the group label. I recommend you mostly don’t touch the target_transform, preprocess, and postprocess columns.
You can find examples of constructing problems in the folktablessource code if you really want to carefully customize your problem.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
We can then extract predictions on the test set like this:
y_hat = model.predict(X_test)
The overall accuracy in predicting whether someone is employed is:
(y_hat == y_test).mean()
0.7842193386354123
The accuracy for white individuals is
(y_hat == y_test)[group_test ==1].mean()
0.7838255977496483
The accuracy for Black individuals is
(y_hat == y_test)[group_test ==2].mean()
0.7838630806845965
We can also calculate confusion matrices, false positive rates, false negative rates, positive predictive values, prevalences, and lots of other information using tools we’ve already seen.
2 What You Should Do
Choose Your Problem
Choose a prediction problem (target variable), a list of features, and a choice of group with respect to which to evaluate bias. I would suggest one of the following two possibilities:
(What we just illustrated): predict employment status on the basis of demographics excluding race, and audit for racial bias.
Predict whether income is over $50K on the basis of demographics excluding sex, and audit for gender bias.
You can also pick the state from which you would like to pull your data.
Do not audit for racial bias in VT, as we didn’t have enough Black individuals fill out the PUMS survey. 😬
Finally, you should choose a machine learning model. While you can use a model like logistic regression that you’ve previously implemented, my suggestion is to use one out of the box from scikit-learn. Some simple classifiers with good performance are:
Use simple descriptive analysis to address the following questions. You’ll likely find it easiest to address these problems when working with a data frame. Here’s some code to turn your training data back into a data frame for easy analysis:
Using this data frame, answer the following questions:
How many individuals are in the data?
Of these individuals, what proportion have target label equal to 1? In employment prediction, these would correspond to employed individuals.
Of these individuals, how many are in each of the groups?
In each group, what proportion of individuals have target label equal to 1?
Check for intersectional trends by studying the proportion of positive target labels broken out by your chosen group labels and an additional group labe. For example, if you chose race (RAC1P) as your group, then you could also choose sex (SEX) and compute the proportion of positive labels by both race and sex. This might be a good opportunity to use a visualization such as a bar chart, e.g. via the seaborn package.
Train Your Model
Train your model on the training data. Please incorporate a tunable model complexity and use cross-validation in order to select a good choice for the model complexity. Some possibilities:
Use polynomial features with LogisticRegression.
Tune the regularization parameter C in SVC.
Tune the max_depth of in DecisionTreeClassifier and in RandomForestClassifier.
Audit Your Model
Then, perform an audit in which you address the following questions (all on test data):
Overall Measures
What is the overall accuracy of your model?
What is the positive predictive value (PPV) of your model?
What are the overall false negative and false positive rates (FNR and FPR) for your model?
By-Group Measures
What is the accuracy of your model on each subgroup?
What is the PPV of your model on each subgroup?
What are the FNR and FPR on each subgroup?
Bias Measures
See Chouldechova (2017) for definitions of these terms. For calibration, you can think of the score as having only two values, 0 and 1.
Is your model approximately calibrated?
Does your model satisfy approximate error rate balance?
Does your model satisfy statistical parity?
Concluding Discussion
In a few paragraphs, discuss the following questions:
What groups of people could stand to benefit from a system that is able to predict the label you predicted, such as income or employment status? For example, what kinds of companies might want to buy your model for commercial use?
Based on your bias audit, what could be the impact of deploying your model for large-scale prediction in commercial or governmental settings?
Based on your bias audit, do you feel that your model displays problematic bias? What kind (calibration, error rate, etc)?
Beyond bias, are there other potential problems associated with deploying your model that make you uncomfortable? How would you propose addressing some of these problems?
3 Optional Extras
Intersectional Bias?
As an optional component of your bias audit, you could consider checking for intersectional bias in your model. For example, is the FNR significantly higher for Black women than it is for Black men or white women?
To address this question, you’ll likely find it is easier to work with a data frame again.
As an optional component of your bias audit, you could reproduce Figure 5 in Chouldechova (2017) (link). This figure uses Eq. (2.6), fixing the prevalence (proportion of true positive labels) \(p\) for each group, as well as a desired PPV that should be the same across both groups. With these numbers fixed, eq. (2.6) then defines a line of feasible FNR and FPR rates, which you could plot. Don’t worry about reproducing the shaded regions unless you really want to.
Chouldechova, Alexandra. 2017. “Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments.”Big Data 5 (2): 153–63.
Ding, Frances, Moritz Hardt, John Miller, and Ludwig Schmidt. 2021. “Retiring Adult: New Datasets for Fair Machine Learning.”Advances in Neural Information Processing Systems 34.