DS 6030 | Spring 2026 | University of Virginia
Homework #2: Model Selection and Performance Estimation
Set-up
In this homework you will study model tuning, predictive performance, and uncertainty quantification using elastic net regression. All models should be fit using elastic net type regularization, with tuning over the regularization/penalty parameter and, where feasible, the mixing parameter.
Data
The goal is to predict the year that a song was released based on audio features. Each dataset contains an outcome variable (Y song release year) and a set of \(p=45\) numeric predictors (X1 to X45) derived from audio features.
Modeling requirements
All predictive models in this homework must be elastic net regression models. You may use any software package you prefer that implements elastic net regression, such as glmnet and tidymodels in R or scikit learn in Python.
Unless otherwise stated, performance should be evaluated using root mean squared error (RMSE).
Elastic net regression
Elastic net regression is a linear regression model with regularization that combines the ideas of ridge regression and the lasso. The model is fit by minimizing a loss function that balances prediction accuracy with a penalty on the size of the regression coefficients.
Specifically, elastic net minimizes the prediction error plus a weighted combination of an \(\ell_1\) penalty (lasso) and an \(\ell_2\) penalty (ridge).
Two tuning parameters control this balance. The mixture parameter controls the tradeoff between \(\ell_1\) (lasso) and \(\ell_2\) (ridge) penalties. The penalty parameter controls the strength of the overall penalty.
Computational guidance and fallback options
Some parts of this homework involve repeated or nested cross validation, which can be computationally demanding depending on your hardware. If you encounter computing or memory limitations, you may use one or more of the following simplifications. Clearly state any simplifications you use in your solution.
- Reduce the size of the training and validation data by random subsampling, while keeping the test set fixed.
- Reduce the cross validation complexity by using fewer folds or fewer repetitions.
- Fix mixture to 1 (lasso) and tune only penalty.
You should not modify the test dataset or reduce its size unless explicitly instructed.
Problem 1: Tuning using a single validation set
a. Tuning
Fit models on the training data and use the validation set to tune mixture and penalty to minimize RMSE. Clearly report the selected values of mixture and penalty, along with the corresponding validation RMSE.
b. Fit final model
Refit the model using the combined training and validation data, fixing mixture and penalty at the selected values from part (a).
c. Predict on a subset of the test data
Using the final fitted model, predict outcomes for the first 1000 observations in the test set. Compute and report the test RMSE for this subset.
d. 90% confidence interval via normal theory
Construct a 90% confidence interval for the RMSE in part (c) using a normal theory approximation.
Report the CI.
e. 90% confidence interval via bootstrap
Construct a 90% confidence interval for the RMSE in part (c) using a bootstrap procedure (e.g., https://openintro-ims.netlify.app/foundations-bootstrapping. Specify:
- the bootstrap type (e.g., percentile, bias-corrected, normal, studentized)
- the number of bootstrap resamples used
Report the CI.
f. Full test set evaluation
Predict outcomes for the entire test set and compute the RMSE.
g. 90% confidence intervals for the full test set
Using the predictions from part (f), construct 90% confidence intervals for the RMSE using:
- normal theory
- bootstrap method (use the same bootstrap approach as in part e.)
Report the CI.
h. Visualization
Create a single graphic that communicates uncertainty in predictive performance. The graphic must show
- the RMSE point estimates
- the 90% confidence intervals
for both test set sizes (partial test set from part c and full test set from part f) and for both confidence interval methods (normal theory and bootstrap).
Problem 2: Tuning using repeated cross-validation
a. Tuning with repeated cross-validation
Combine the training and validation datasets. Using this combined dataset only, tune mixture and penalty via repeated Monte Carlo cross-validation to minimize RMSE. Clearly state:
- the number of hold-out(test) observations
- the number of repetition
Report the selected mixture and penalty, along with the mean cross-validated RMSE.
Provide a brief justification for the chosen cross-validation configuration (e.g., bias–variance considerations, computational cost, dataset size).
b. Estimating RMSE and uncertainty from cross-validation
Using the repeated cross-validation results from part (a), estimate the RMSE and construct a 90% confidence interval without using the test data. Clearly describe how the RMSE estimate and confidence interval are obtained from the cross-validation folds. Any approach to estimate the confidence interval is suitable.
c. Final model fit
Fit the final model on the combined training and validation dataset using the selected mixture and penalty.
d. Test set evaluation
Using the fitted model, predict outcomes for the entire test set and compute the test RMSE.
e. Test set confidence interval
Construct a 90% confidence interval for the RMSE using any appropriate method (e.g., normal theory or bootstrap). Provide details on what method was used.
Report the CI.
f. Visualization
Create a single graphic that compares uncertainty estimates before and after observing the test data. Show:
- the RMSE estimate and 90% confidence interval from cross-validation
- the RMSE estimate and 90% confidence interval from the test set
g. Reflection
In a short paragraph, describe what you learned from this exercise. Discuss how the cross-validation based RMSE and confidence interval compared to the test set results, what surprised you (if anything), and which approach you would trust most for reporting predictive performance in practice. Briefly explain why.
Problem 3: Nested Cross-Validation
a. Nested cross-validation implementation
Combine the training and validation datasets. Implement nested cross-validation, where the outer cross-validation loop is used to estimate predictive performance and the inner cross-validation loop is used to tune mixture and penalty.
Clearly state:
- the cross-validation approach taken for the outer loop
- the cross-validation approach taken for the inner loop
- a brief justification for the chosen cross-validation configuration
Report the selected mixture and penalty and mean RMSE for each (outer) fold.
b. Estimating RMSE and uncertainty from nested cross-validation
Using the outer-loop cross-validation results from part (a), estimate the RMSE and construct a 90% confidence interval without using the test data. Clearly describe how the RMSE estimate and confidence interval are obtained from the outer-fold results.
c. Final model fit
Fit a final model on the combined training and validation dataset using tuning parameter values chosen based on the nested cross-validation results.
Clearly describe the approach used to select the final tuning parameters.
d. Test set evaluation
Using the fitted model, predict outcomes for the entire test set and compute the test RMSE.
e. Test set confidence interval
Construct a 90% confidence interval for the RMSE using any appropriate method (e.g., normal theory or bootstrap). Provide details on what method was used.
Problem 4: Comparison
a. Comparison
Compare the results across all modeling and evaluation approaches considered in this homework. Clearly report, for each approach:
- the selected tuning parameters
- the estimated RMSE
- the associated confidence interval
b. Reflection
In a short paragraph, describe what you learned from this homework. Discuss how the choice of tuning and evaluation strategy affects estimated predictive performance and uncertainty, and state which approach you would recommend in practice. Justify your recommendation.