DS 6030 | Fall 2024 | University of Virginia

Homework #5: Probability and Classification

Author

Your Name Here

Published

September 30, 2024

Required R packages and Directories

dir_data= 'https://mdporter.github.io/teaching/data/' # data directory
library(glmnet)
library(tidyverse) # functions for data manipulation  

Crime Linkage

Crime linkage attempts to determine if a set of unsolved crimes share a common offender. Pairwise crime linkage is the more simple task of deciding if two crimes share a common offender; it can be considered a binary classification problem. The linkage training data has 8 evidence variables that measure the similarity between a pair of crimes:

  • spatial is the spatial distance between the crimes
  • temporal is the fractional time (in days) between the crimes
  • tod and dow are the differences in time of day and day of week between the crimes
  • LOC, POA, and MOA are binary with a 1 corresponding to a match (type of property, point of entry, method of entry)
  • TIMERANGE is the time between the earliest and latest possible times the crime could have occurred (because the victim was away from the house during the crime).
  • The response variable indicates if the crimes are linked (\(y=1\)) or unlinked (\(y=0\)).

These problems use the linkage-train and linkage-test datasets (click on links for data).

Load Crime Linkage Data

Solution

Add solution here

Problem 1: Penalized Regression for Crime Linkage

a. Fit a penalized linear regression model to predict linkage.

Use an elastic net penalty (including lasso and ridge) (your choice).

  • Report the value of \(\alpha \in [0, 1]\) used.
  • Report the value of \(\lambda\) used.
  • Report the estimated coefficients.
Solution

Add solution here

b. Fit a penalized logistic regression model to predict linkage.

Use an elastic net penalty (including lasso and ridge) (your choice).

  • Report the value of \(\alpha \in [0, 1]\) used.
  • Report the value of \(\lambda\) used.
  • Report the estimated coefficients.
Solution

Add solution here

Problem 2: Random Forest for Crime Linkage

Fit a random forest model to predict crime linkage.

  • Report the loss function (or splitting rule) used.
  • Report any non-default tuning parameters.
  • Report the variable importance (indicate which importance method was used).
Solution

Add solution here

Problem 3: ROC Curves

a. ROC curve: training data

Produce one plot that has the ROC curves, using the training data, for all three models (linear, logistic, and random forest). Use color and/or linetype to distinguish between models and include a legend.
Also report the AUC (area under the ROC curve) for each model. Again, use the training data.

  • Note: you should be weary of being asked to evaluation predictive performance from the same data used to estimate the tuning and model parameters. The next problem will walk you through a more proper way of evaluating predictive performance with resampling.
Solution

Add solution here

b. ROC curve: resampling estimate

Recreate the ROC curve from the penalized logistic regression (logreg) and random forest (rf) models using repeated hold-out data. The following steps will guide you:

  • For logreg, use \(\alpha=.75\). For rf use mtry = 2, num.trees = 1000, and fix any other tuning parameters at your choice.
  • Run the following steps 25 times:
    1. Hold out 500 observations.
    2. Use the remaining observations to estimate \(\lambda\) using 10-fold CV for the logreg model. Don’t tune any rf parameters.
    3. Predict the probability of linkage for the 500 hold-out observations.
    4. Store the predictions and hold-out labels.
    5. Calculate the AUC.
  • Report the mean AUC and standard error for both models. Compare to the results from part a.
  • Produce two plots showing the 25 ROC curves for each model.
  • Note: by estimating \(\lambda\) each iteration, we are incorporating the uncertainty present in estimating that tuning parameter.
Solution

Add solution here

Problem 4: Contest

a. Contest Part 1: Predict the estimated probability of linkage.

Predict the estimated probability of linkage for the test data (using any model).

  • Submit a .csv file (ensure comma separated format) named lastname_firstname_1.csv that includes the column named p that is your estimated posterior probability. We will use automated evaluation, so the format must be exact.
  • You are free to any model (even ones we haven’t yet covered in the course).
  • You are free to use any data transformation or feature engineering.
  • You will receive credit for a proper submission; the top five scores will receive 2 bonus points.
  • Your probabilities will be evaluated with respect to the mean negative Bernoulli log-likelihood (known as the average log-loss metric): \[ L = - \frac{1}{M} \sum_{i=1}^m [y_i \log \, \hat{p}_i + (1 - y_i) \log \, (1 - \hat{p}_i)] \] where \(M\) is the number of test observations, \(\hat{p}_i\) is the prediction for the \(i\)th test observation, and \(y_i \in \{0,1\}\) are the true test set labels.
Solution

Add solution here

b. Contest Part 2: Predict the linkage label.

Predict the linkages for the test data (using any model).

  • Submit a .csv file (ensure comma separated format) named lastname_firstname_2.csv that includes the column named linkage that takes the value of 1 for linked pairs and 0 for unlinked pairs. We will use automated evaluation, so the format must be exact.
  • You are free to any model (even ones we haven’t yet covered in the course).
  • You are free to use any data transformation or feature engineering.
  • Your labels will be evaluated based on total cost, where cost is equal to 1*FP + 8*FN. This implies that False Negatives (FN) are 8 times as costly as False Positives (FP).
  • You will receive credit for a proper submission; the top five scores will receive 2 bonus points. Note: you only will get bonus credit for one of the two contests.
Solution

Add solution here