SYS 6018 | Spring 2024 | University of Virginia

Homework #8: Feature Engineering and Importance

Author

Your Name Here

Published

April 1, 2024

Required R packages and Directories

dir_data = 'https://mdporter.github.io/SYS6018/data/' # data directory
library(tidyverse) # functions for data manipulation   

Problem 1: Permutation Feature Importance

Vanderbilt Biostats has collected data on Titanic survivors (https://hbiostat.org/data/). I have done some simple processing and split into a training and test sets.

We are going to use this data to investigate feature importance. Use Class, Sex, Age, Fare, sibsp (number of siblings or spouse on board), parch (number of parents or children on board), and Joined (city where passenger boarded) for the predictor variables (features) and Survived as the outcome variable.

a. Method 1: Built-in importance scores

Fit a tree ensemble model (e.g., Random Forest, boosted tree) on the training data. You are free to use any method to select the tuning parameters.

Report the built-in feature importance scores and produce a barplot with feature on the x-axis and importance on the y-axis.

Solution

Add solution here

b. Performance

Report the performance of the model fit from (a.) on the test data. Use the log-loss (where \(M\) is the size of the test data): \[ \text{log-loss}(\hat{p}) = - \frac{1}{M} \sum_{i=1}^m [y_i \log \, \hat{p}_i + (1 - y_i) \log \, (1 - \hat{p}_i)] \]

Solution

Add solution here

c. Method 2: Permute after fitting

Use the fitted model from question (a.) to perform permutation feature importance. Shuffle/permute each variable individually on the test set before making predictions. Record the loss. Repeat \(M=10\) times and produce a boxplot of the change in loss (change from reported loss from part b.).

Solution

Add solution here

d. Method 3: Permute before fitting

For this approach, shuffle/permute the training data and re-fit the ensemble model. Evaluate the predictions on the (unaltered) test data. Repeat \(M=10\) times (for each predictor variable) and produce a boxplot of the change in loss.

Solution

Add solution here

e. Understanding

Describe the benefits of each of the three approaches to measure feature importance.

Solution

Add solution here

Problem 2: Effects of correlated predictors

This problem will illustrate what happens to the importance scores when there are highly associated predictors.

a. Create an almost duplicate feature

Create a new feature Sex2 that is 95% the same as Sex. Do this by selecting 5% of training (\(n=50\)) and testing (\(n=15\)) data and flip the Sex value.

Solution

Add solution here

b. Method 1: Built-in importance

Fit the same model as in Problem 1a, but use the new data that includes Sex2 (i.e., use both Sex and Sex2 in the model). Calculate the built-in feature importance score and produce a barplot.

Solution

Add solution here

c. Method 2: Permute after fitting

Redo Method 2 (problem 1c) on the new data/model and produce a boxplot of importance scores. The importance score is defined as the difference in loss.

Solution

Add solution here

d. Method 3: Permute before fitting

Redo Method 3 (problem 1d) on the new data and produce a boxplot of importance scores. The importance score is defined as the difference in loss.

Solution

Add solution here

e. Understanding

Describe how the addition of the almost duplicated predictor impacted the feature importance results.

Solution

Add solution here