DS 6030 | Fall 2024 | University of Virginia

Homework #2: Resampling

Author

Your Name Here

Published

September 9, 2024

Required R packages and Directories

data_dir = 'https://mdporter.github.io/teaching/data/' # data directory
library(tidymodels)# for optional tidymodels solutions
library(tidyverse) # functions for data manipulation

Problem 1: Bootstrapping

Bootstrap resampling can be used to quantify the uncertainty in a fitted curve.

a. Data Generating Process

Create a set of functions to generate data from the following distributions: \[\begin{align*} X &\sim \mathcal{U}(0, 2) \qquad \text{Uniform between $0$ and $2$}\\ Y &= 1 + 2x + 5\sin(5x) + \epsilon \\ \epsilon &\sim \mathcal{N}(0,\, \sigma=2.5) \end{align*}\]

Solution

Add solution here

b. Simulate data

Simulate $n=100$ realizations from these distributions. Produce a scatterplot and draw the true regression line $f(x) = E[Y \mid X=x]$. Use set.seed(211) prior to generating the data.

Solution

Add solution here

c. 5th degree polynomial fit

Fit a 5th degree polynomial. Produce a scatterplot and draw the estimated regression curve.

Solution

Add solution here

d. Bootstrap sampling

Make 200 bootstrap samples. For each bootstrap sample, fit a 5th degree polynomial and make predictions at eval_pts = seq(0, 2, length=100)

Set the seed (use set.seed(212)) so your results are reproducible.
Produce a scatterplot with the original data and add the 200 bootstrap curves

Solution

Add solution here

e. Confidence Intervals

Calculate the pointwise 95% confidence intervals from the bootstrap samples. That is, for each $x \in {\rm eval\_pts}$, calculate the upper and lower limits such that only 5% of the curves fall outside the interval at $x$.

Remake the plot from part c, but add the upper and lower boundaries from the 95% confidence intervals.

Solution

Add solution here

Problem 2: V-Fold cross-validation with $k$ nearest neighbors

Run 10-fold cross-validation on the data generated in part 1b to select the optimal $k$ in a k-nearest neighbor (kNN) model. Then evaluate how well cross-validation performed by evaluating the performance on a large test set. The steps below will guide you.

a. Implement 10-fold cross-validation

Use $10$-fold cross-validation to find the value of $k$ (i.e., neighborhood size) that provides the smallest cross-validated MSE using a kNN model.

Search over $k=3,4,\ldots, 40$.
Use set.seed(221) prior to generating the folds to ensure the results are replicable.
Show the following:
- the optimal $k$ (as determined by cross-validation)
- the corresponding estimated MSE
- produce a plot with $k$ on the x-axis and the estimated MSE on the y-axis (optional: add 1-standard error bars).
Notation: The $k$ is the tuning paramter for the kNN model. The $v=10$ is the number of folds in V-fold cross-validation. Don’t get yourself confused.

Solution

Add solution here

b. Find the optimal edf

The $k$ (number of neighbors) in a kNN model determines the effective degrees of freedom edf. What is the optimal edf? Be sure to use the correct sample size when making this calculation. Produce a plot similar to that from part a, but use edf (effective degrees of freedom) on the x-axis.

Solution

Add solution here

c. Choose $k$

After running cross-validation, a final model fit from all of the training data needs to be produced to make predictions. What value of $k$ would you choose? Why?

Solution

Add solution here

d. Evaluate actual performance

Now we will see how well cross-validation performed. Simulate a test data set of $50000$ observations from the same distributions. Use set.seed(223) prior to generating the test data.

Fit a set of kNN models, using the full training data, and calculate the mean squared error (MSE) on the test data for each model. Use the same $k$ values in a.
Report the optimal $k$, the corresponding edf, and MSE based on the test set.

Solution

Add solution here

e. Performance plots

Plot both the cross-validation estimated and (true) error calculated from the test data on the same plot. See Figure 5.6 in ISL (pg 182) as a guide.

Produce two plots: one with $k$ on the x-axis and one with edf on the x-axis.
Each plot should have two lines: one from part a and one from part d

Solution

Add solution here

f. Did cross-validation work as intended?

Based on the plots from e, does it appear that cross-validation worked as intended? How sensitive is the choice of $k$ on the resulting test MSE?

Solution

Add solution here

Homework #2: Resampling

Required R packages and Directories

Problem 1: Bootstrapping

a. Data Generating Process

b. Simulate data

c. 5th degree polynomial fit

d. Bootstrap sampling

e. Confidence Intervals

Problem 2: V-Fold cross-validation with \(k\) nearest neighbors

a. Implement 10-fold cross-validation

b. Find the optimal edf

c. Choose \(k\)

d. Evaluate actual performance

e. Performance plots

f. Did cross-validation work as intended?