DS 6030 | Fall 2024 | University of Virginia

Homework #6: SVM and Calibration

Author

Your Name Here

Published

October 4, 2024

Required R packages and Directories

dir_data= 'https://mdporter.github.io/teaching/data/' # data directory
library(tidyverse)  # functions for data manipulation

COMPAS Recidivism Prediction

A recidivism risk model called COMPAS was the topic of a ProPublica article on ML bias. Because the data and notebooks used for article was released on github, we can also evaluate the prediction bias (i.e., calibration).

This code will read in the violent crime risk score and apply the filtering used in the analysis.

Code

library(tidyverse)
df = read_csv("https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years-violent.csv")

risk = df %>% 
  filter(days_b_screening_arrest <= 30) %>%
  filter(days_b_screening_arrest >= -30) %>% 
  filter(is_recid != -1) %>%
  filter(c_charge_degree != "O") %>%
  filter(v_score_text != 'N/A') %>% 
  transmute(
    age, age_cat,
    charge = ifelse(c_charge_degree == "F", "Felony", "Misdemeanor"),
    race,
    sex,                 
    priors_count = priors_count...15,
    score = v_decile_score,              # the risk score {1,2,...,10}
    outcome = two_year_recid...53        # outcome {1 = two year recidivate}
  )

The risk data frame has the relevant information for completing the problems.

Problem 1: COMPAS risk score

a. Risk Score and Probability (table)

Assess the predictive bias in the COMPAS risk scores by evaluating the probability of recidivism, e.g. estimate \(\Pr(Y = 1 \mid \text{Score}=x)\). Use any reasonable techniques (including Bayesian) to estimate the probability of recidivism for each risk score.

Specifically, create a table (e.g., data frame) that provides the following information:

The COMPASS risk score.
The point estimate of the probability of recidivism for each risk score.
95% confidence or credible intervals for the probability (e.g., Using normal theory, bootstrap, or Bayesian techniques).

Indicate the choices you made in estimation (e.g., state the prior if you used Bayesian methods).

Solution

Add solution here

b. Risk Score and Probability (plot)

Make a plot of the risk scores and corresponding estimated probability of recidivism.

Put the risk score on the x-axis and estimate probability of recidivism on y-axis.
Add the 95% confidence or credible intervals calculated in part a.
Comment on the patterns you see.

Solution

Add solution here

c. Risk Score and Probability (by race)

Repeat the analysis, but this time do so for every race. Produce a set of plots (one per race) and comment on the patterns.

Solution

Add solution here

d. ROC Curves

Use the raw COMPAS risk scores to make a ROC curve for each race.

Are the best discriminating models the ones you expected?
Are the ROC curves helpful in evaluating the COMPAS risk score?

Solution

Add solution here

Problem 2: Support Vector Machines (SVM)

Focus on Problem 1, we won’t have an SVM problem this week.