SYS 6018 | Spring 2024 | University of Virginia

Homework #6: Trees and Random Forest

Author

Your Name Here

Published

March 9, 2024

Required R packages and Directories

data.dir = 'https://mdporter.github.io/DS6018/data/' # data directory
library(tidyverse)    # functions for data manipulation  
library(ranger)       # fast random forest implementation
library(modeldata)    # for the ames housing data

Problem 1: Tree Splitting for classification

Consider the Gini index, classification error, and entropy impurity measures in a simple classification setting with two classes.

Create a single plot that displays each of these quantities as a function of \(p_m\), the estimated probability of an observation in node \(m\) being from class 1. The x-axis should display \(p_m\), ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.

Solution

Solution here

Problem 2: Combining bootstrap estimates

Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of \(X\), produce the following 10 estimates of \(\Pr(\text{Class is Red} \mid X=x)\): \(\{0.2, 0.25, 0.3, 0.4, 0.4, 0.45, 0.7, 0.85, 0.9, 0.9\}\).

a. Majority Vote

ISLR 8.2 describes the majority vote approach for making a hard classification from a set of bagged classifiers. What is the final classification for this example using majority voting?

Solution

Solution here

b. Average Probability

An alternative is to base the final classification on the average probability. What is the final classification for this example using average probability?

Solution

Solution here

c. Unequal misclassification costs

Suppose the cost of mis-classifying a Red observation (as Green) is twice as costly as mis-classifying a Green observation (as Red). How would you modify both approaches to make better final classifications under these unequal costs? Report the final classifications.

Solution

Solution here

Problem 3: Random Forest Tuning

Random forest has several tuning parameters that you will explore in this problem. We will use the ames housing data from the modeldata R package.

There are several R packages for Random Forest. The ranger::ranger() function is much faster than randomForest::randomForest() so we will use this one.

a. Random forest (ranger) tuning parameters

List all of the random forest tuning parameters in the ranger::ranger() function. You don’t need to list the parameters related to computation, special models (e.g., survival, maxstat), or anything that won’t impact the predictive performance.

Indicate the tuning parameters you think will be most important to optimize?

Solution

Solution here

b. Implement Random Forest

Use a random forest model to predict the sales price, Sale_Price. Use the default parameters and report the 10-fold cross-validation RMSE (square root of mean squared error).

Solution

Solution here

c. Random Forest Tuning

Now we will vary the tuning parameters of mtry and min.bucket to see what effect they have on performance.

  • Use a range of reasonable mtry and min.bucket values.
    • The valid mtry values are \(\{1,2, \ldots, p\}\) where \(p\) is the number of predictor variables. However the default value of mtry = sqrt(p) = 8 is usually close to optimal, so you may want to focus your search around those values.
    • The default min.bucket=1 will grow deep trees. This will usually work best if there are enough trees. But try some values larger and see how it impacts predictive performance.
    • Set num.trees=1000, which is larger than the default of 500.
  • Use 5 times repeated out-of-bag (OOB) to assess performance. That is, run random forest 5 times for each tuning set, calculate the OOB MSE each time and use the average for the MSE associated with the tuning parameters.
  • Use a plot to show the average MSE as a function of mtry and min.bucket.
  • Report the best tuning parameter combination.
  • Note: random forest is a stochastic model; it will be different every time it runs due to the bootstrap sampling and random selection of features to consider for splitting. Set the random seed to control the uncertainty associated with the stochasticity.
  • Hint: If you use the ranger package, the prediction.error element in the output is the OOB MSE.
Solution

Solution here