= 'https://mdporter.github.io/teaching/data/' # data directory
dir_datalibrary(glmnet)
library(tidyverse) # functions for data manipulation
DS 6030 | Fall 2024 | University of Virginia
Homework #5: Probability and Classification
Required R packages and Directories
Crime Linkage
Crime linkage attempts to determine if a set of unsolved crimes share a common offender. Pairwise crime linkage is the more simple task of deciding if two crimes share a common offender; it can be considered a binary classification problem. The linkage training data has 8 evidence variables that measure the similarity between a pair of crimes:
spatial
is the spatial distance between the crimestemporal
is the fractional time (in days) between the crimestod
anddow
are the differences in time of day and day of week between the crimesLOC
,POA,
andMOA
are binary with a 1 corresponding to a match (type of property, point of entry, method of entry)TIMERANGE
is the time between the earliest and latest possible times the crime could have occurred (because the victim was away from the house during the crime).- The response variable indicates if the crimes are linked (\(y=1\)) or unlinked (\(y=0\)).
These problems use the linkage-train and linkage-test datasets (click on links for data).
Load Crime Linkage Data
Problem 1: Penalized Regression for Crime Linkage
a. Fit a penalized linear regression model to predict linkage.
Use an elastic net penalty (including lasso and ridge) (your choice).
- Report the value of \(\alpha \in [0, 1]\) used.
- Report the value of \(\lambda\) used.
- Report the estimated coefficients.
b. Fit a penalized logistic regression model to predict linkage.
Use an elastic net penalty (including lasso and ridge) (your choice).
- Report the value of \(\alpha \in [0, 1]\) used.
- Report the value of \(\lambda\) used.
- Report the estimated coefficients.
Problem 2: Random Forest for Crime Linkage
Fit a random forest model to predict crime linkage.
- Report the loss function (or splitting rule) used.
- Report any non-default tuning parameters.
- Report the variable importance (indicate which importance method was used).
Problem 3: ROC Curves
a. ROC curve: training data
Produce one plot that has the ROC curves, using the training data, for all three models (linear, logistic, and random forest). Use color and/or linetype to distinguish between models and include a legend.
Also report the AUC (area under the ROC curve) for each model. Again, use the training data.
- Note: you should be weary of being asked to evaluation predictive performance from the same data used to estimate the tuning and model parameters. The next problem will walk you through a more proper way of evaluating predictive performance with resampling.
b. ROC curve: resampling estimate
Recreate the ROC curve from the penalized logistic regression (logreg) and random forest (rf) models using repeated hold-out data. The following steps will guide you:
- For logreg, use \(\alpha=.75\). For rf use mtry = 2, num.trees = 1000, and fix any other tuning parameters at your choice.
- Run the following steps 25 times:
- Hold out 500 observations.
- Use the remaining observations to estimate \(\lambda\) using 10-fold CV for the logreg model. Don’t tune any rf parameters.
- Predict the probability of linkage for the 500 hold-out observations.
- Store the predictions and hold-out labels.
- Calculate the AUC.
- Report the mean AUC and standard error for both models. Compare to the results from part a.
- Produce two plots showing the 25 ROC curves for each model.
- Note: by estimating \(\lambda\) each iteration, we are incorporating the uncertainty present in estimating that tuning parameter.
Problem 4: Contest
a. Contest Part 1: Predict the estimated probability of linkage.
Predict the estimated probability of linkage for the test data (using any model).
- Submit a .csv file (ensure comma separated format) named
lastname_firstname_1.csv
that includes the column named p that is your estimated posterior probability. We will use automated evaluation, so the format must be exact. - You are free to any model (even ones we haven’t yet covered in the course).
- You are free to use any data transformation or feature engineering.
- You will receive credit for a proper submission; the top five scores will receive 2 bonus points.
- Your probabilities will be evaluated with respect to the mean negative Bernoulli log-likelihood (known as the average log-loss metric): \[ L = - \frac{1}{M} \sum_{i=1}^m [y_i \log \, \hat{p}_i + (1 - y_i) \log \, (1 - \hat{p}_i)] \] where \(M\) is the number of test observations, \(\hat{p}_i\) is the prediction for the \(i\)th test observation, and \(y_i \in \{0,1\}\) are the true test set labels.
b. Contest Part 2: Predict the linkage label.
Predict the linkages for the test data (using any model).
- Submit a .csv file (ensure comma separated format) named
lastname_firstname_2.csv
that includes the column named linkage that takes the value of 1 for linked pairs and 0 for unlinked pairs. We will use automated evaluation, so the format must be exact. - You are free to any model (even ones we haven’t yet covered in the course).
- You are free to use any data transformation or feature engineering.
- Your labels will be evaluated based on total cost, where cost is equal to
1*FP + 8*FN
. This implies that False Negatives (FN) are 8 times as costly as False Positives (FP).
- You will receive credit for a proper submission; the top five scores will receive 2 bonus points. Note: you only will get bonus credit for one of the two contests.