= 'https://mdporter.github.io/teaching/data/' # data directory
dir_data library(tidyverse) # functions for data manipulation
DS 6030 | Fall 2024 | University of Virginia
Homework #9: Feature Importance
Required R packages and Directories
Problem 1: Permutation Feature Importance
Vanderbilt Biostats has collected data on Titanic survivors (https://hbiostat.org/data/). I have done some simple processing and split into a training and test sets.
We are going to use this data to investigate feature importance. Use Class
, Sex
, Age
, Fare
, sibsp
(number of siblings or spouse on board), parch
(number of parents or children on board), and Joined
(city where passenger boarded) for the predictor variables (features) and Survived
as the outcome variable.
a. Load the titanic traning and testing data
b. Method 1: Built-in importance scores
Fit a tree ensemble model (e.g., Random Forest, boosted tree) on the training data. You are free to use any method to select the tuning parameters.
Report the built-in feature importance scores and produce a barplot with feature on the x-axis and importance on the y-axis.
c. Performance
Report the performance of the model fit from (a.) on the test data. Use the log-loss (where \(M\) is the size of the test data): \[ \text{log-loss}(\hat{p}) = - \frac{1}{M} \sum_{i=1}^m [y_i \log \, \hat{p}_i + (1 - y_i) \log \, (1 - \hat{p}_i)] \]
d. Method 2: Permute after fitting
Use the fitted model from question (a.) to perform permutation feature importance. Shuffle/permute each variable individually on the test set before making predictions. Record the loss. Repeat \(M=10\) times and produce a boxplot of the change in loss (change from reported loss from part b.).
e. Method 3: Permute before fitting
For this approach, shuffle/permute the training data and re-fit the ensemble model. Evaluate the predictions on the (unaltered) test data. Repeat \(M=10\) times (for each predictor variable) and produce a boxplot of the change in loss.
f. Understanding
Describe the benefits of each of the three approaches to measure feature importance.