library(tidyverse) # functions for data manipulation
SYS 6018 | Spring 2024 | University of Virginia
Homework #1: Supervised Learning
Required R packages and Directories
Problem 1: Evaluating a Regression Model
a. Data generating functions
Create a set of functions to generate data from the following distributions:
\[\begin{align*} X &\sim \mathcal{N}(0, 1) \\ Y &= -1 + .5X + .2X^2 + \epsilon \\ \epsilon &\sim \mathcal{N}(0,\, \sigma) \end{align*}\]
b. Generate training data
Simulate \(n=100\) realizations from these distributions using \(\sigma=3\). Produce a scatterplot and draw the true regression line \(f(x) = E[Y \mid X=x]\).
- Use
set.seed(611)
prior to generating the data.
c. Fit three models
Fit three polynomial regression models using least squares: linear, quadratic, and cubic. Produce another scatterplot, add the fitted lines and true population line \(f(x)\) using different colors, and add a legend that maps the line color to a model.
- Note: The true model is quadratic, but we are also fitting linear (less complex) and cubic (more complex) models.
d. Predictive performance
Generate a test data set of 10,000 observations from the same distributions. Use set.seed(612)
prior to generating the test data.
- Calculate the estimated mean squared error (MSE) for each model.
- Are the results as expected?
e. Optimal performance
What is the best achievable MSE? That is, what is the MSE if the true \(f(x)\) was used to evaluate the test set? How close does the best method come to achieving the optimum?
f. Replication
The MSE scores obtained in part d came from one realization of training data. Here will we explore how much variation there is in the MSE scores by replicating the simulation many times.
- Re-run parts b. and c. (i.e., generate training data and fit models) 100 times.
- Do not generate new testing data
- Use
set.seed(613)
prior to running the simulation and do not set the seed in any other places.
- Calculate the test MSE for all simulations.
- Use the same test data from part d. (This question is only about the variability that comes from the training data).
- Create kernel density or histogram plots of the resulting MSE values for each model.lots of the resulting MSE values for each model.
g. Best model
Show a count of how many times each model was the best. That is, out of the 100 simulations, count how many times each model had the lowest MSE.
h. Function to implement simulation
Write a function that implements the simulation in part f. The function should have arguments for i) the size of the training data \(n\), ii) the standard deviation of the random error \(\sigma\), and iii) the test data. Use the same set.seed(613)
.
i. Performance when \(\sigma=2\)
Use your function to repeat the simulation in part f, but use \(\sigma=2\). Report the number of times each model was best (you do not need to produce any plots).
- First generate new test data with (\(n = 10000\), \(\sigma = 2\), using
seed = 612
).
j. Performance when \(\sigma=4\) and \(n=300\)
Repeat i, but now use \(\sigma=4\) and \(n=300\).
- First generate new test data with (\(n = 10000\), \(\sigma = 4\), using
seed = 612
).
k. Understanding
Describe the effects \(\sigma\) and \(n\) has on selection of the best model? Why is the true model form (i.e., quadratic) not always the best model to use when prediction is the goal?