Auditing Allocative Bias

In this blog post, you’ll create a machine learning model that predicts an individual characteristic like employment status or income on the basis of other demographic characteristics. You’ll then perform a fairness audit in order to assess whether or not your algorithm displays bias with respect demographic characteristics like race or sex, and discuss your findings.

Published

March 29, 2023

This one of two possible blog posts for this week. This blog post involves working with data and performing audits. If you’d rather write a research essay on the limitations of the quantitative approach to analyzing bias and discrimination, see the alternative assignment.

folktables was introduced by Ding et al. (2021).

The folktables package allows you to download and neatly organize data from the American Community Survey’s Public Use Microdata Sample (PUMS). You can install it in your ml-0451 environment by running the following two commands in your terminal:

conda activate ml-0451
pip install folktables

You can learn more about the folktables package, including documentation and examples, on the package’s GitHub page.

In this blog post, you’ll fit a classifier using data from folktables and perform a bias audit for the algorithm.

1 Using folktables

The first thing to do is to download some data! Here’s an illustration of downloading a complete set of PUMS data for the state of Alabama.

from folktables import ACSDataSource, ACSEmployment, BasicProblem, adult_filter
import numpy as np

STATE = "AL"

data_source = ACSDataSource(survey_year='2018', 
                            horizon='1-Year', 
                            survey='person')

acs_data = data_source.get_data(states=[STATE], download=True)

acs_data.head()
RT SERIALNO DIVISION SPORDER PUMA REGION ST ADJINC PWGTP AGEP ... PWGTP71 PWGTP72 PWGTP73 PWGTP74 PWGTP75 PWGTP76 PWGTP77 PWGTP78 PWGTP79 PWGTP80
0 P 2018GQ0000049 6 1 1600 3 1 1013097 75 19 ... 140 74 73 7 76 75 80 74 7 72
1 P 2018GQ0000058 6 1 1900 3 1 1013097 75 18 ... 76 78 7 76 80 78 7 147 150 75
2 P 2018GQ0000219 6 1 2000 3 1 1013097 118 53 ... 117 121 123 205 208 218 120 19 123 18
3 P 2018GQ0000246 6 1 2400 3 1 1013097 43 28 ... 43 76 79 77 80 44 46 82 81 8
4 P 2018GQ0000251 6 1 2701 3 1 1013097 16 25 ... 4 2 29 17 15 28 17 30 15 1

5 rows × 286 columns

There are approximately 48,000 rows of PUMS data in this data frame. Each one corresponds to an individual citizen of the given STATE who filled out the 2018 edition of the PUMS survey. You’ll notice that there are a lot of columns. In the modeling tasks we’ll use here, we’re only going to focus on a relatively small number of features. Here are all the possible features I suggest you use:

possible_features=['AGEP', 'SCHL', 'MAR', 'RELP', 'DIS', 'ESP', 'CIT', 'MIG', 'MIL', 'ANC', 'NATIVITY', 'DEAR', 'DEYE', 'DREM', 'SEX', 'RAC1P', 'ESR']
acs_data[possible_features].head()
AGEP SCHL MAR RELP DIS ESP CIT MIG MIL ANC NATIVITY DEAR DEYE DREM SEX RAC1P ESR
0 19 18.0 5 17 2 NaN 1 3.0 4.0 1 1 2 2 2.0 2 1 6.0
1 18 18.0 5 17 2 NaN 1 3.0 4.0 1 1 2 2 2.0 2 2 6.0
2 53 17.0 5 16 1 NaN 1 1.0 4.0 2 1 2 2 1.0 1 1 6.0
3 28 19.0 5 16 2 NaN 1 1.0 2.0 1 1 2 2 2.0 1 1 6.0
4 25 12.0 5 16 1 NaN 1 3.0 4.0 1 1 2 2 1.0 2 1 6.0

For documentation on what these features mean, you can consult the appendix of the paper that introduced the package.

For a few examples:

  • ESR is employment status (1 if employed, 0 if not)
  • RAC1P is race (1 for White Alone, 2 for Black/African American alone, 3 and above for other self-identified racial groups)
  • SEX is binary sex (1 for male, 2 for female)
  • DEAR, DEYE, and DREM relate to certain disability statuses.

Let’s consider the following task: we are going to

  1. Train a machine learning algorithm to predict whether someone is currently employed, based on their other attributes not including race, and
  2. Perform a bias audit of our algorithm to determine whether it displays racial bias.

First, let’s subset the features we want to use:

features_to_use = [f for f in possible_features if f not in ["ESR", "RAC1P"]]

Now we can construct a BasicProblem that expresses our wish to use these features to predict employment status ESR, using the race RAC1P as the group label. I recommend you mostly don’t touch the target_transform, preprocess, and postprocess columns.

You can find examples of constructing problems in the folktables source code if you really want to carefully customize your problem.
EmploymentProblem = BasicProblem(
    features=features_to_use,
    target='ESR',
    target_transform=lambda x: x == 1,
    group='RAC1P',
    preprocess=lambda x: x,
    postprocess=lambda x: np.nan_to_num(x, -1),
)

features, label, group = EmploymentProblem.df_to_numpy(acs_data)

The result is now a feature matrix features, a label vector label, and a group label vector group, in convenient format with which we can work.

for obj in [features, label, group]:
  print(obj.shape)
(47777, 15)
(47777,)
(47777,)

Before we touch the data any more, we should perform a train-test split:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(
    features, label, group, test_size=0.2, random_state=0)

Now we are ready to create a model and train it on the training data:

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix

model = make_pipeline(StandardScaler(), LogisticRegression())
model.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We can then extract predictions on the test set like this:

y_hat = model.predict(X_test)

The overall accuracy in predicting whether someone is employed is:

(y_hat == y_test).mean()
0.7842193386354123

The accuracy for white individuals is

(y_hat == y_test)[group_test == 1].mean()
0.7838255977496483

The accuracy for Black individuals is

(y_hat == y_test)[group_test == 2].mean()
0.7838630806845965

We can also calculate confusion matrices, false positive rates, false negative rates, positive predictive values, prevalences, and lots of other information using tools we’ve already seen.

2 What You Should Do

Choose Your Problem

Choose a prediction problem (target variable), a list of features, and a choice of group with respect to which to evaluate bias. I would suggest one of the following two possibilities:

  1. (What we just illustrated): predict employment status on the basis of demographics excluding race, and audit for racial bias.
  2. Predict whether income is over $50K on the basis of demographics excluding sex, and audit for gender bias.

You can also pick the state from which you would like to pull your data.

Do not audit for racial bias in VT, as we didn’t have enough Black individuals fill out the PUMS survey. 😬

Finally, you should choose a machine learning model. While you can use a model like logistic regression that you’ve previously implemented, my suggestion is to use one out of the box from scikit-learn. Some simple classifiers with good performance are:

  • sklearn.linear_model.LogisticRegression
  • sklearn.svm.SVC (support vector machine)
  • sklearn.tree.DecisionTreeClassifier (decision tree)
  • sklearn.ensemble.RandomForestClassifier (random forest)

Basic Descriptives

Use simple descriptive analysis to address the following questions. You’ll likely find it easiest to address these problems when working with a data frame. Here’s some code to turn your training data back into a data frame for easy analysis:

import pandas as pd
df = pd.DataFrame(X_train, columns = features_to_use)
df["group"] = group_train
df["label"] = y_train

Using this data frame, answer the following questions:

  1. How many individuals are in the data?
  2. Of these individuals, what proportion have target label equal to 1? In employment prediction, these would correspond to employed individuals.
  3. Of these individuals, how many are in each of the groups?
  4. In each group, what proportion of individuals have target label equal to 1?
  5. Check for intersectional trends by studying the proportion of positive target labels broken out by your chosen group labels and an additional group labe. For example, if you chose race (RAC1P) as your group, then you could also choose sex (SEX) and compute the proportion of positive labels by both race and sex. This might be a good opportunity to use a visualization such as a bar chart, e.g. via the seaborn package.

Train Your Model

Train your model on the training data. Please incorporate a tunable model complexity and use cross-validation in order to select a good choice for the model complexity. Some possibilities:

  • Use polynomial features with LogisticRegression.
  • Tune the regularization parameter C in SVC.
  • Tune the max_depth of in DecisionTreeClassifier and in RandomForestClassifier.

Audit Your Model

Then, perform an audit in which you address the following questions (all on test data):

Overall Measures
  1. What is the overall accuracy of your model?
  2. What is the positive predictive value (PPV) of your model?
  3. What are the overall false negative and false positive rates (FNR and FPR) for your model?
By-Group Measures
  1. What is the accuracy of your model on each subgroup?
  2. What is the PPV of your model on each subgroup?
  3. What are the FNR and FPR on each subgroup?
Bias Measures

See Chouldechova (2017) for definitions of these terms. For calibration, you can think of the score as having only two values, 0 and 1.
  • Is your model approximately calibrated?
  • Does your model satisfy approximate error rate balance?
  • Does your model satisfy statistical parity?

Concluding Discussion

In a few paragraphs, discuss the following questions:

  1. What groups of people could stand to benefit from a system that is able to predict the label you predicted, such as income or employment status? For example, what kinds of companies might want to buy your model for commercial use?
  2. Based on your bias audit, what could be the impact of deploying your model for large-scale prediction in commercial or governmental settings?
  3. Based on your bias audit, do you feel that your model displays problematic bias? What kind (calibration, error rate, etc)?
  4. Beyond bias, are there other potential problems associated with deploying your model that make you uncomfortable? How would you propose addressing some of these problems?

3 Optional Extras

Intersectional Bias?

As an optional component of your bias audit, you could consider checking for intersectional bias in your model. For example, is the FNR significantly higher for Black women than it is for Black men or white women?

To address this question, you’ll likely find it is easier to work with a data frame again.

import pandas as pd
df = pd.DataFrame(X_test, columns = features_to_use)
df["group"] = group_test
df["label"] = y_test

Feasible FNR and FPR Rates

As an optional component of your bias audit, you could reproduce Figure 5 in Chouldechova (2017) (link). This figure uses Eq. (2.6), fixing the prevalence (proportion of true positive labels) \(p\) for each group, as well as a desired PPV that should be the same across both groups. With these numbers fixed, eq. (2.6) then defines a line of feasible FNR and FPR rates, which you could plot. Don’t worry about reproducing the shaded regions unless you really want to.



© Phil Chodrow, 2023

References

Chouldechova, Alexandra. 2017. “Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments.” Big Data 5 (2): 153–63.
Ding, Frances, Moritz Hardt, John Miller, and Ludwig Schmidt. 2021. “Retiring Adult: New Datasets for Fair Machine Learning.” Advances in Neural Information Processing Systems 34.