‘Optimal’ Decision-Making

2024-02-29

The expectations for all blog posts apply!

In two sets of lecture notes, we studied the theory of making (binary) decisions based on a linear score function. We illustrated the theory in the context of credit-risk prediction, in which we attempted to identify prospective borrowers who are most likely to default on loans. For the purposes of illustrating the theory, our analysis in those lecture notes was very simple. In particular, we used only two features, and did not make a major attempt to find an optimal vector of weights \(\mathbf{w}\).

You have two aims in this blog post:

Find a score function and a threshold which optimize total expected profit for the bank, using any of the features contained in the data and any method, under slightly more realistic assumptions. We’ll think of your score function and threshold as defining an automated decision system for a hypothetical bank extending credit.
Assess how your automated decision system impacts different segments of the population of prospective borrowers.

Part A: Grab the Data

Start by downloading the training data:

import pandas as pd
url = "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/credit-risk/train.csv"
df_train = pd.read_csv(url)

The columns in this data are:

person_age, the age of the prospective borrower.
person_income, the income of the prospective borrower at time of application.
person_home_ownership, the home ownership status of the prospective borrower at time of application. Possible values are MORTGAGE, OWN, RENT, and OTHER.
person_emp_length, the length of most recent employment for the prospective borrower, in years.
loan_intent, the purpose of the loan request.
loan_grade, a composite measure of the likelihood of the borrower to repay the loan. You may not use this column in your score function in Part B.
loan_amnt, the amount of the loan.
loan_int_rate, the annual interest rate on the loan.
loan_status, whether or not the borrower defaulted on the loan. This is the target variable.
loan_percent_income, the amount of the loan as a percentage of the prospective borrower’s personal income.
cb_person_default_on_file, whether the prospective borrower has previously defaulted on a loan in the records of a credit bureau.
cb_person_cred_hist_length, the length of credit history of the prospective borrower.

Part B: Explore The Data

Create at least two visualizations and one summary table in which you explore patterns in the data. You might consider some questions like:

How does loan intent vary with the age, length of employment, or homeownership status of an individual?
Which segments of prospective borrowers are offered low interest rates? Which segments are offered high interest rates?
Which segments of prospective borrowers have access to large lines of credit?

Part C: Build a Model

Please use any technique to construct a score function and threshold for predicting whether a prospective borrower is likely to default on a given loan. You may use all the features in the data except loan_grade (and the target variable loan_status), and you may choose any subset of these. There are several valid ways to approach this modeling task:

Choose features and estimate entries of a weight vector \(\mathbf{w}\) by hand (this is allowed but not recommended).
(Recommended): Choose your features, estimate new ones if needed, and fit a score-based machine learning model to the data. My suggestion is LogisticRegression. Once you have fit a logistic regression model, the weight vector \(\mathbf{w}\) is stored as the attribute model.coef_.

I suggest that you try several combinations of features, possibly including some which you create, and test out which combinations work best with cross-validation.

Part D: Find a Threshold

Once you have a weight vector \(\mathbf{w}\), it is time to choose a threshold \(t\). To choose a threshold that maximizes profit for the bank, we need to make some assumptions about how the bank makes and loses money on loans. Let’s use the following (simplified) modeling assumptions:

If the loan is repaid in full, the profit for the bank is equal to loan_amnt*(1 + 0.25*loan_int_rate)**10 - loan_amnt. This formula assumes that the profit earned by the bank on a 10-year loan is equal to 25% of the interest rate each year, with the other 75% of the interest going to things like salaries for the people who manage the bank. It is extremely simplistic and does not account for inflation, amortization over time, opportunity costs, etc.
If the borrower defaults on the loan, the “profit” for the bank is equal to loan_amnt*(1 + 0.25*loan_int_rate)**3 - 1.7*loan_amnt. This formula corresponds to the same profit-earning mechanism as above, but assumes that the borrower defaults three years into the loan and that the bank loses 70% of the principal.

These modeling assumptions are extremely simplistic! You may deviate from these assumptions if you have relevant prior knowledge to inform your approach!!

Based on your assumptions, determine the threshold \(t\) which optimizes profit for the bank on the training set. Explain your approach, including labeled visualizations where appropriate, and include a final estimate of the bank’s expected profit per borrower on the training set.

Part E: Evaluate Your Model from the Bank’s Perspective

Only after you have finalized your weight vector \(\mathbf{w}\) and threshold \(t\), evaluate your automated decision-process on the test set:

url = "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/credit-risk/test.csv"
df_test = pd.read_csv(url)

What is the expected profit per borrower on the test set? Is it similar to your profit on the training set?

Part F: Evaluate Your Model From the Borrower’s Perspective

Now evaluate your model from the (aggregate) perspective of the prospective borrowers. Please quantitatively address the following questions, using the predictions of your model on the test data:

Is it more difficult for people in certain age groups to access credit under your proposed system?
Is it more difficult for people to get loans in order to pay for medical expenses? How does this compare with the actual rate of default in that group? What about people seeking loans for business ventures or education?
How does a person’s income level impact the ease with which they can access credit under your decision system?

Part G: Write and Reflect

Write a brief introductory paragraph for your blog post describing the overall purpose, methodology, and findings of your study. Then, write a concluding discussion describing what you found and what you learned through from this blog post.

Please include one paragraph discussing the following questions:

Considering that people seeking loans for medical expense have high rates of default, is it fair that it is more difficult for them to obtain access to credit?

You are free to define “fairness” in a way that makes sense to you, but please write down your definition as part of your discussion.