DS 6030 | Spring 2026 | University of Virginia
Homework #3: Trees and Random Forest
Problem 1: Prediction Contest
This problem uses the realestate-train and realestate-test (click on links for data).
The goal of this contest is to predict sale price (in thousands) (price column) using an Random Forest model. Evaluation of the test data will be based on the mean squared error for the \(m\) test set observations \[
{\text{MSE}} = \frac{1}{m}\sum_{i=1}^m (y_i - \hat{y}_i)^2
\]
a. Load and pre-process data
Load the data and create necessary data structures for running Random Forest.
- There are some categorical/nominal features. You decide the best way to handle them. Some implementations allow categorical data (R’s
ranger,randomforest) while others (scikit-learn) don’t. - For this problem, you are free to use any data transformations or feature engineering.
b. Training
Train a Random Forest model to predict the price of the test data.
- You are free to use any data transformation or feature engineering.
- You are free to use any tuning parameters.
- Report the tuning parameters you used to make your final predictions. Be sure to report any default parameters even if you didn’t tune them.
- Describe how you choose those tuning parameters.
c. Submit predictions
Submit a .csv file names lastname_firstname.csv (comma separated, no extra spaces) containing your predictions. The file must include one column named yhat, with one prediction per row in the same order as the test data. Submissions will be evaluated using an automated grader:
- Files that do not follow the required format exactly may not be graded and will lose up to 1 point.
- The top three scores from each section will receive an additional 0.5 bonus points.
d. Report anticpated performance
Report the anticipated mean squared error (MSE) of your final model on the test data. Provide: a point estimate and a 92% confidence interval for the anticipated MSE.
Your goal is to provide an honest assessment of out-of-sample performance and uncertainty. After grading, you will compare your reported interval to the actual test MSE to assess the calibration of your performance estimates.
Problem 2: Tree Splitting (from scratch)
Implement a one split prediction tree (a stump) by explicitly computing the gain across all possible split points across a combination of predictor and outcome types.
For each problem, find the optimal split. Report the split and gain.
You are not expected to build a full recursive tree. You do not need to optimize for speed.
You may use R or Python, but do not use any tree or forest libraries to perform the splitting.
a. Numeric Predictor (x_num), Numeric Outcome (y_num)
Report the split and gain using Sum of Squared Errors (SSE).
b. Numeric Predictor (x_num), Categorical Outcome (y_cat)
Report the split and gain using Cross-Entropy.
c. Categorical Predictor (x_cat), Numeric Outcome (y_num)
Report the split and gain using Sum of Squared Errors (SSE).
d. Categorical Predictor (x_cat), Categorical Outcome (y_cat)
Report the split and gain using Cross-Entropy.
Problem 3: Random Forest Tuning
The goal of this problem is to compare different strategies for tuning Random Forests and estimating prediction error. You will tune tree complexity and feature subsampling, and evaluate how out-of-bag (OOB) error and cross-validation (CV) behave in practice.
This problem is not assigned.