SYS 6018 | Spring 2024 | University of Virginia

Homework #4: Probability and Classification

Author

Your Name Here

Published

February 17, 2024

This is an independent assignment. Do not discuss or work with classmates.

Required R packages and Directories

dir_data= 'https://mdporter.github.io/SYS6018/data/' # data directory
library(glmnet)    # for glmnet() functions
library(yardstick) # for evaluation metrics
library(tidyverse) # functions for data manipulation  

Crime Linkage

Crime linkage attempts to determine if a set of unsolved crimes share a common offender. Pairwise crime linkage is the more simple task of deciding if two crimes share a common offender; it can be considered a binary classification problem. The linkage training data has 8 evidence variables that measure the similarity between a pair of crimes:

  • spatial is the spatial distance between the crimes
  • temporal is the fractional time (in days) between the crimes
  • tod and dow are the differences in time of day and day of week between the crimes
  • LOC, POA, and MOA are binary with a 1 corresponding to a match (type of property, point of entry, method of entry)
  • TIMERANGE is the time between the earliest and latest possible times the crime could have occurred (because the victim was away from the house during the crime).
  • The response variable indicates if the crimes are linked (\(y=1\)) or unlinked (\(y=0\)).

These problems use the linkage-train and linkage-test datasets (click on links for data).

Load Crime Linkage Data

Solution

Add solution here.

Problem 1: Penalized Regression for Crime Linkage

a. Fit a penalized linear regression model to predict linkage.

Use an elastic net penalty (including lasso and ridge) (your choice).

  • Report the value of \(\alpha \in [0, 1]\) used
  • Report the value of \(\lambda\) used
  • Report the estimated coefficients
Solution

Add solution here.

b. Fit a penalized logistic regression model to predict linkage.

Use an elastic net penalty (including lasso and ridge) (your choice).

  • Report the value of \(\alpha \in [0, 1]\) used
  • Report the value of \(\lambda\) used
  • Report the estimated coefficients
Solution

Add solution here.

c. ROC curve: training data

Produce one plot that has the ROC curves, using the training data, for both models (from part a and b). Use color and/or linetype to distinguish between models and include a legend.

Solution

Add solution here.

d. ROC curve: resampling estimate

Recreate the ROC curve from the penalized logistic regression model using repeated hold-out data. The following steps will guide you:

  • Fix \(\alpha=.75\)
  • Run the following steps 25 times:
    1. Hold out 500 observations
    2. Use the remaining observations to estimate \(\lambda\) using 10-fold CV
    3. Predict the probability of linkage for the 500 hold-out observations
    4. Store the predictions and hold-out labels
  • Combine the results and produce the hold-out based ROC curve from all of the hold-out data. I’m looking for a single ROC curve using the predictions for all 12,500 (25 x 500) observations rather than 25 different curves.
  • Note: by estimating \(\lambda\) each iteration, we are incorporating the uncertainty present in estimating that tuning parameter.
Solution

Add solution here.

e. Contest Part 1: Predict the estimated probability of linkage.

Predict the estimated probability of linkage for the test data (using any model).

  • Submit a .csv file (ensure comma separated format) named lastname_firstname_1.csv that includes the column named p that is your estimated posterior probability. We will use automated evaluation, so the format must be exact.
  • You are free to any model (even ones we haven’t yet covered in the course).
  • You are free to use any data transformation or feature engineering.
  • You will receive credit for a proper submission; the top five scores will receive 2 bonus points.
  • Your probabilities will be evaluated with respect to the mean negative Bernoulli log-likelihood (known as the average log-loss metric): \[ L = - \frac{1}{M} \sum_{i=1}^m [y_i \log \, \hat{p}_i + (1 - y_i) \log \, (1 - \hat{p}_i)] \] where \(M\) is the number of test observations, \(\hat{p}_i\) is the prediction for the \(i\)th test observation, and \(y_i \in \{0,1\}\) are the true test set labels.
Solution

Add solution here.

f. Contest Part 2: Predict the linkage label.

Predict the linkages for the test data (using any model).

  • Submit a .csv file (ensure comma separated format) named lastname_firstname_2.csv that includes the column named linkage that takes the value of 1 for linked pairs and 0 for unlinked pairs. We will use automated evaluation, so the format must be exact.
  • You are free to any model (even ones we haven’t yet covered in the course).
  • You are free to use any data transformation or feature engineering.
  • Your labels will be evaluated based on total cost, where cost is equal to 1*FP + 8*FN. This implies that False Negatives (FN) are 8 times as costly as False Positives (FP).
  • You will receive credit for a proper submission; the top five scores will receive 2 bonus points. Note: you only will get bonus credit for one of the two contests.
Solution

Add solution here.