= 'https://mdporter.github.io/teaching/data/' # data directory
data_dir library(tidyverse) # functions for data manipulation
library(ranger) # fast random forest implementation
library(modeldata) # for the ames housing data
DS 6030 | Fall 2024 | University of Virginia
Homework #4: Trees and Random Forest
This is an independent assignment. Do not discuss or work with classmates.
Required R packages and Directories
Problem 1: Tree splitting metrics for classification
Consider the Gini index, classification error, and entropy impurity measures in a simple classification setting with two classes.
Create a single plot that displays each of these quantities as a function of \(p_m\), the estimated probability of an observation in node \(m\) being from class 1. The x-axis should display \(p_m\), ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.
Problem 2: Combining bootstrap estimates
Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of \(X\), produce the following 10 estimates of \(\Pr(\text{Class is Red} \mid X=x)\): \(\{0.2, 0.25, 0.3, 0.4, 0.4, 0.45, 0.7, 0.85, 0.9, 0.9\}\).
a. Majority Vote
ISLR 8.2 describes the majority vote approach for making a hard classification from a set of bagged classifiers. What is the final classification for this example using majority voting?
b. Average Probability
An alternative is to base the final classification on the average probability. What is the final classification for this example using average probability?
Problem 3: Random Forest Tuning
Random forest has several tuning parameters that you will explore in this problem. We will use the ames
housing data from the modeldata
R package.
There are several R packages for Random Forest. The ranger::ranger()
function is much faster than randomForest::randomForest()
so we will use this one.
a. Random forest (ranger
) tuning parameters
List all of the random forest tuning parameters in the ranger::ranger()
function. You don’t need to list the parameters related to computation, special models (e.g., survival, maxstat), or anything that won’t impact the predictive performance.
Indicate the tuning parameters you think will be most important to optimize?
b. Implement Random Forest
Use a random forest model to predict the sales price, Sale_Price
. Use the default parameters and report the 10-fold cross-validation RMSE (square root of mean squared error).
c. Random Forest Tuning
Now we will vary the tuning parameters of mtry
and min.bucket
to see what effect they have on performance.
- Use a range of reasonable
mtry
andmin.bucket
values.- The valid
mtry
values are \(\{1,2, \ldots, p\}\) where \(p\) is the number of predictor variables. However the default value ofmtry = sqrt(p) =
8 is usually close to optimal, so you may want to focus your search around those values. - The default
min.bucket=1
will grow deep trees. This will usually work best if there are enough trees. But try some values larger and see how it impacts predictive performance. - Set
num.trees=1000
, which is larger than the default of 500.
- The valid
- Use 5 times repeated out-of-bag (OOB) to assess performance. That is, run random forest 5 times for each tuning set, calculate the OOB MSE each time and use the average for the MSE associated with the tuning parameters.
- Use a single plot to show the average MSE as a function of
mtry
andmin.bucket
. - Report the best tuning parameter combination.
- Note: random forest is a stochastic model; it will be different every time it runs due to the bootstrap sampling and random selection of features to consider for splitting. Set the random seed to control the uncertainty associated with the stochasticity.
- Hint: If you use the
ranger
package, theprediction.error
element in the output is the OOB MSE.