DS 6030 | Spring 2026 | University of Virginia

Homework #4: Probability and Classification

Author

First Last (abc2de)

Published

Spring 2026

Crime Linkage

Crime linkage attempts to determine if a set of unsolved crimes share a common offender. Pairwise crime linkage is the more simple task of deciding if two crimes share a common offender; it can be considered a binary classification problem. The linkage training data has 8 evidence variables that measure the similarity between a pair of crimes:

  • spatial is the spatial distance between the crimes
  • temporal is the fractional time (in days) between the crimes
  • tod and dow are the differences in time of day and day of week between the crimes
  • LOC, POA, and MOA are binary with a 1 corresponding to a match (type of property, point of entry, method of entry)
  • TIMERANGE is the time between the earliest and latest possible times the crime could have occurred (because the victim was away from the house during the crime).
  • The outcome variable indicates if the crimes are linked (\(y=1\)) or unlinked (\(y=0\)).

These problems use the linkage-train and linkage-test datasets (click on links for data).

Load Crime Linkage Data

NoteSolution

Load data here

Problem 1: Penalized Regression for Crime Linkage

a. Fit a penalized linear regression model to predict linkage.

Use an elastic net penalty (including lasso and ridge) (your choice).

  • Report the selected tuning parameters.
  • Report the estimated coefficients.
NoteSolution

Add solution here

b. Fit a penalized logistic regression model to predict linkage.

Use an elastic net penalty (including lasso and ridge) (your choice).

  • Report the selected tuning parameters.
  • Report the estimated coefficients.
NoteSolution

Add solution here

Problem 2: Random Forest for Crime Linkage

Fit a random forest model to predict crime linkage.

  • Report the loss function (or splitting rule) used.
  • Report any non-default tuning parameters.
  • Report variable importance (indicate which importance method was used).
NoteSolution

Add solution here

Problem 3: ROC Curves

a. ROC curve: training data

Using the training data, produce a single plot showing the ROC curves for all three models: linear, logistic, and random forest. Distinguish models using color and/or line type, and include a legend.

For each model, report the AUC computed from the training data.

Note: evaluating predictive performance on the same data used to estimate model parameters and tune hyperparameters is generally optimistic. This part is for illustration only. In the next problem, you will use resampling to obtain a more appropriate estimate of out of sample predictive performance.

NoteSolution

Add solution here

b. ROC curve: resampling estimate

Recreate ROC curves for the penalized logistic regression (logreg) and random forest (rf) models using repeated hold out validation. Follow the steps below.

  • Model setup
    • For logreg, fix the mixture = 0.75 (close to the lasso). You will tune the penalty parameter.
    • For rf, fix mtry = 2 and num.trees = 1000. Fix any remaining tuning parameters at values of your choice. You won’t tune anything for random forest.
  • Resampling procedure: Repeat the following steps 25 times:
    1. Randomly hold out 500 observations.
    2. Fit each model using the remaining observations.
      • For penalized logistic regression, select the regularization/penalty strength using 10 fold cross validation within the training set.
      • Do not tune any random forest parameters.
    3. Predict the probability of linkage for the 500 hold out observations.
    4. Store the predicted probabilities and the true hold out labels.
    5. Compute the AUC for the hold out set.
  • Reporting and visualization
    • Report the mean AUC and standard error across the 25 repetitions for each model.
    • Compare these results to the training data AUCs from part a.
    • Produce two plots, one for logreg and one for rf, each showing the 25 ROC curves from the resampling procedure.
  • Note: because penalty term is selected each repetition, this procedure incorporates uncertainty from tuning the penalization parameter, in addition to uncertainty from the train test split.
NoteSolution

Add solution here

Problem 4: Contest

For these problems:

  • You are free to any model (even ones we haven’t yet covered in the course).
  • You are free to use any data transformation or feature engineering.
  • You will receive credit for a proper submission; the top three scores from each section will receive an additional 0.5 bonus points. However, you cannot receive double credit if you are in leaderboard for both contests.
  • We will use automated evaluation of the predictions, so the format specified in the problem must be exact. Take at look at your .csv file before uploading.

a. Contest Part 1: Predict the estimated probability of linkage.

Predict the estimated probability of linkage for the test data (using any model).

  • Submit a .csv file (ensure comma separated format) named lastname_firstname_1.csv that includes the column named p that is your estimated posterior probability. We will use automated evaluation, so the format must be exact.
  • Your probabilities will be evaluated with respect to the mean negative Bernoulli log-likelihood (known as the average log-loss metric): \[ L = - \frac{1}{M} \sum_{i=1}^m [y_i \log \, \hat{p}_i + (1 - y_i) \log \, (1 - \hat{p}_i)] \] where \(M\) is the number of test observations, \(\hat{p}_i\) is the prediction for the \(i\)th test observation, and \(y_i \in \{0,1\}\) are the true test set labels.
NoteSolution

Add solution here

b. Contest Part 2: Predict the linkage label.

Predict the linkages for the test data (using any model).

  • Submit a .csv file (ensure comma separated format) named lastname_firstname_2.csv that includes the column named linkage that takes the value of 1 for linked pairs and 0 for unlinked pairs. We will use automated evaluation, so the format must be exact.
  • Your labels will be evaluated based on total cost, where cost is equal to 1*FP + 8*FN. This implies that False Negatives (FN) are 8 times as costly as False Positives (FP).
NoteSolution

Add solution here