= 'https://mdporter.github.io/teaching/data/' # data directory
data_dir library(tidymodels)# for optional tidymodels solutions
library(tidyverse) # functions for data manipulation
DS 6030 | Fall 2024 | University of Virginia
Homework #2: Resampling
Required R packages and Directories
Problem 1: Bootstrapping
Bootstrap resampling can be used to quantify the uncertainty in a fitted curve.
a. Data Generating Process
Create a set of functions to generate data from the following distributions: \[\begin{align*} X &\sim \mathcal{U}(0, 2) \qquad \text{Uniform between $0$ and $2$}\\ Y &= 1 + 2x + 5\sin(5x) + \epsilon \\ \epsilon &\sim \mathcal{N}(0,\, \sigma=2.5) \end{align*}\]
b. Simulate data
Simulate \(n=100\) realizations from these distributions. Produce a scatterplot and draw the true regression line \(f(x) = E[Y \mid X=x]\). Use set.seed(211)
prior to generating the data.
c. 5th degree polynomial fit
Fit a 5th degree polynomial. Produce a scatterplot and draw the estimated regression curve.
d. Bootstrap sampling
Make 200 bootstrap samples. For each bootstrap sample, fit a 5th degree polynomial and make predictions at eval_pts = seq(0, 2, length=100)
- Set the seed (use
set.seed(212)
) so your results are reproducible. - Produce a scatterplot with the original data and add the 200 bootstrap curves
e. Confidence Intervals
Calculate the pointwise 95% confidence intervals from the bootstrap samples. That is, for each \(x \in {\rm eval\_pts}\), calculate the upper and lower limits such that only 5% of the curves fall outside the interval at \(x\).
- Remake the plot from part c, but add the upper and lower boundaries from the 95% confidence intervals.
Problem 2: V-Fold cross-validation with \(k\) nearest neighbors
Run 10-fold cross-validation on the data generated in part 1b to select the optimal \(k\) in a k-nearest neighbor (kNN) model. Then evaluate how well cross-validation performed by evaluating the performance on a large test set. The steps below will guide you.
a. Implement 10-fold cross-validation
Use \(10\)-fold cross-validation to find the value of \(k\) (i.e., neighborhood size) that provides the smallest cross-validated MSE using a kNN model.
- Search over \(k=3,4,\ldots, 40\).
- Use
set.seed(221)
prior to generating the folds to ensure the results are replicable. - Show the following:
- the optimal \(k\) (as determined by cross-validation)
- the corresponding estimated MSE
- produce a plot with \(k\) on the x-axis and the estimated MSE on the y-axis (optional: add 1-standard error bars).
- Notation: The \(k\) is the tuning paramter for the kNN model. The \(v=10\) is the number of folds in V-fold cross-validation. Don’t get yourself confused.
b. Find the optimal edf
The \(k\) (number of neighbors) in a kNN model determines the effective degrees of freedom edf. What is the optimal edf? Be sure to use the correct sample size when making this calculation. Produce a plot similar to that from part a, but use edf (effective degrees of freedom) on the x-axis.
c. Choose \(k\)
After running cross-validation, a final model fit from all of the training data needs to be produced to make predictions. What value of \(k\) would you choose? Why?
d. Evaluate actual performance
Now we will see how well cross-validation performed. Simulate a test data set of \(50000\) observations from the same distributions. Use set.seed(223)
prior to generating the test data.
- Fit a set of kNN models, using the full training data, and calculate the mean squared error (MSE) on the test data for each model. Use the same \(k\) values in a.
- Report the optimal \(k\), the corresponding edf, and MSE based on the test set.
e. Performance plots
Plot both the cross-validation estimated and (true) error calculated from the test data on the same plot. See Figure 5.6 in ISL (pg 182) as a guide.
- Produce two plots: one with \(k\) on the x-axis and one with edf on the x-axis.
- Each plot should have two lines: one from part a and one from part d
f. Did cross-validation work as intended?
Based on the plots from e, does it appear that cross-validation worked as intended? How sensitive is the choice of \(k\) on the resulting test MSE?