DS 6030 | Fall 2024 | University of Virginia

Homework #5: Probability and Classification

Author

Your Name Here

Published

September 30, 2024

Required R packages and Directories

dir_data= 'https://mdporter.github.io/teaching/data/' # data directory
library(glmnet)
library(tidyverse) # functions for data manipulation

Crime Linkage

Crime linkage attempts to determine if a set of unsolved crimes share a common offender. Pairwise crime linkage is the more simple task of deciding if two crimes share a common offender; it can be considered a binary classification problem. The linkage training data has 8 evidence variables that measure the similarity between a pair of crimes:

spatial is the spatial distance between the crimes
temporal is the fractional time (in days) between the crimes
tod and dow are the differences in time of day and day of week between the crimes
LOC, POA, and MOA are binary with a 1 corresponding to a match (type of property, point of entry, method of entry)
TIMERANGE is the time between the earliest and latest possible times the crime could have occurred (because the victim was away from the house during the crime).
The response variable indicates if the crimes are linked (\(y=1\)) or unlinked (\(y=0\)).

These problems use the linkage-train and linkage-test datasets (click on links for data).

Load Crime Linkage Data

Solution

Add solution here

Problem 1: Penalized Regression for Crime Linkage

a. Fit a penalized linear regression model to predict linkage.

Use an elastic net penalty (including lasso and ridge) (your choice).

Report the value of \(\alpha \in [0, 1]\) used.
Report the value of \(\lambda\) used.
Report the estimated coefficients.

Solution

Add solution here

b. Fit a penalized logistic regression model to predict linkage.

Use an elastic net penalty (including lasso and ridge) (your choice).

Report the value of \(\alpha \in [0, 1]\) used.
Report the value of \(\lambda\) used.
Report the estimated coefficients.

Solution

Add solution here

Problem 2: Random Forest for Crime Linkage

Fit a random forest model to predict crime linkage.

Report the loss function (or splitting rule) used.
Report any non-default tuning parameters.
Report the variable importance (indicate which importance method was used).

Solution

Add solution here

Problem 3: ROC Curves

a. ROC curve: training data

Produce one plot that has the ROC curves, using the training data, for all three models (linear, logistic, and random forest). Use color and/or linetype to distinguish between models and include a legend.
Also report the AUC (area under the ROC curve) for each model. Again, use the training data.

Note: you should be weary of being asked to evaluation predictive performance from the same data used to estimate the tuning and model parameters. The next problem will walk you through a more proper way of evaluating predictive performance with resampling.

Solution

Add solution here

b. ROC curve: resampling estimate

Recreate the ROC curve from the penalized logistic regression (logreg) and random forest (rf) models using repeated hold-out data. The following steps will guide you:

For logreg, use \(\alpha=.75\). For rf use mtry = 2, num.trees = 1000, and fix any other tuning parameters at your choice.
Run the following steps 25 times:
1. Hold out 500 observations.
2. Use the remaining observations to estimate \(\lambda\) using 10-fold CV for the logreg model. Don’t tune any rf parameters.
3. Predict the probability of linkage for the 500 hold-out observations.
4. Store the predictions and hold-out labels.
5. Calculate the AUC.
Report the mean AUC and standard error for both models. Compare to the results from part a.
Produce two plots showing the 25 ROC curves for each model.
Note: by estimating \(\lambda\) each iteration, we are incorporating the uncertainty present in estimating that tuning parameter.

Solution

Add solution here

Problem 4: Contest

a. Contest Part 1: Predict the estimated probability of linkage.

Predict the estimated probability of linkage for the test data (using any model).

Submit a .csv file (ensure comma separated format) named lastname_firstname_1.csv that includes the column named p that is your estimated posterior probability. We will use automated evaluation, so the format must be exact.
You are free to any model (even ones we haven’t yet covered in the course).
You are free to use any data transformation or feature engineering.
You will receive credit for a proper submission; the top five scores will receive 2 bonus points.
Your probabilities will be evaluated with respect to the mean negative Bernoulli log-likelihood (known as the average log-loss metric): \[ L = - \frac{1}{M} \sum_{i=1}^m [y_i \log \, \hat{p}_i + (1 - y_i) \log \, (1 - \hat{p}_i)] \] where \(M\) is the number of test observations, \(\hat{p}_i\) is the prediction for the \(i\)th test observation, and \(y_i \in \{0,1\}\) are the true test set labels.

Solution

Add solution here

b. Contest Part 2: Predict the linkage label.

Predict the linkages for the test data (using any model).

Submit a .csv file (ensure comma separated format) named lastname_firstname_2.csv that includes the column named linkage that takes the value of 1 for linked pairs and 0 for unlinked pairs. We will use automated evaluation, so the format must be exact.
You are free to any model (even ones we haven’t yet covered in the course).
You are free to use any data transformation or feature engineering.
Your labels will be evaluated based on total cost, where cost is equal to 1*FP + 8*FN. This implies that False Negatives (FN) are 8 times as costly as False Positives (FP).
You will receive credit for a proper submission; the top five scores will receive 2 bonus points. Note: you only will get bonus credit for one of the two contests.

Solution

Add solution here