= 'https://mdporter.github.io/SYS6018/data/' # data directory
dir_datalibrary(glmnet) # for glmnet() functions
library(yardstick) # for evaluation metrics
library(tidyverse) # functions for data manipulation
SYS 6018 | Spring 2024 | University of Virginia
Homework #4: Probability and Classification
This is an independent assignment. Do not discuss or work with classmates.
Required R packages and Directories
Crime Linkage
Crime linkage attempts to determine if a set of unsolved crimes share a common offender. Pairwise crime linkage is the more simple task of deciding if two crimes share a common offender; it can be considered a binary classification problem. The linkage training data has 8 evidence variables that measure the similarity between a pair of crimes:
spatial
is the spatial distance between the crimestemporal
is the fractional time (in days) between the crimestod
anddow
are the differences in time of day and day of week between the crimesLOC
,POA,
andMOA
are binary with a 1 corresponding to a match (type of property, point of entry, method of entry)TIMERANGE
is the time between the earliest and latest possible times the crime could have occurred (because the victim was away from the house during the crime).- The response variable indicates if the crimes are linked (\(y=1\)) or unlinked (\(y=0\)).
These problems use the linkage-train and linkage-test datasets (click on links for data).
Load Crime Linkage Data
Problem 1: Penalized Regression for Crime Linkage
a. Fit a penalized linear regression model to predict linkage.
Use an elastic net penalty (including lasso and ridge) (your choice).
- Report the value of \(\alpha \in [0, 1]\) used
- Report the value of \(\lambda\) used
- Report the estimated coefficients
b. Fit a penalized logistic regression model to predict linkage.
Use an elastic net penalty (including lasso and ridge) (your choice).
- Report the value of \(\alpha \in [0, 1]\) used
- Report the value of \(\lambda\) used
- Report the estimated coefficients
c. ROC curve: training data
Produce one plot that has the ROC curves, using the training data, for both models (from part a and b). Use color and/or linetype to distinguish between models and include a legend.
d. ROC curve: resampling estimate
Recreate the ROC curve from the penalized logistic regression model using repeated hold-out data. The following steps will guide you:
- Fix \(\alpha=.75\)
- Run the following steps 25 times:
- Hold out 500 observations
- Use the remaining observations to estimate \(\lambda\) using 10-fold CV
- Predict the probability of linkage for the 500 hold-out observations
- Store the predictions and hold-out labels
- Combine the results and produce the hold-out based ROC curve from all of the hold-out data. I’m looking for a single ROC curve using the predictions for all 12,500 (25 x 500) observations rather than 25 different curves.
- Note: by estimating \(\lambda\) each iteration, we are incorporating the uncertainty present in estimating that tuning parameter.
e. Contest Part 1: Predict the estimated probability of linkage.
Predict the estimated probability of linkage for the test data (using any model).
- Submit a .csv file (ensure comma separated format) named
lastname_firstname_1.csv
that includes the column named p that is your estimated posterior probability. We will use automated evaluation, so the format must be exact. - You are free to any model (even ones we haven’t yet covered in the course).
- You are free to use any data transformation or feature engineering.
- You will receive credit for a proper submission; the top five scores will receive 2 bonus points.
- Your probabilities will be evaluated with respect to the mean negative Bernoulli log-likelihood (known as the average log-loss metric): \[ L = - \frac{1}{M} \sum_{i=1}^m [y_i \log \, \hat{p}_i + (1 - y_i) \log \, (1 - \hat{p}_i)] \] where \(M\) is the number of test observations, \(\hat{p}_i\) is the prediction for the \(i\)th test observation, and \(y_i \in \{0,1\}\) are the true test set labels.
f. Contest Part 2: Predict the linkage label.
Predict the linkages for the test data (using any model).
- Submit a .csv file (ensure comma separated format) named
lastname_firstname_2.csv
that includes the column named linkage that takes the value of 1 for linked pairs and 0 for unlinked pairs. We will use automated evaluation, so the format must be exact. - You are free to any model (even ones we haven’t yet covered in the course).
- You are free to use any data transformation or feature engineering.
- Your labels will be evaluated based on total cost, where cost is equal to
1*FP + 8*FN
. This implies that False Negatives (FN) are 8 times as costly as False Positives (FP).
- You will receive credit for a proper submission; the top five scores will receive 2 bonus points. Note: you only will get bonus credit for one of the two contests.