DS 6030 | Fall 2024 | University of Virginia

Homework #8: Boosting

Author

Your Name Here

Published

November 1, 2024

This is an independent assignment. Do not discuss or work with classmates.

Required R packages and Directories

data_url = "https://archive.ics.uci.edu/static/public/275/bike+sharing+dataset.zip"
library(tidyverse)

Problem 2: Predicting bike rentals

a. Poisson loss

The outcome variables, number of renters, are counts (i.e., non-negative integers). For count data, the variance often scales with the expected count. One way to accommodate this is to model the counts as a Poisson distribution with rate $\lambda_i = \lambda(x_i)$. In lightgbm, the “poisson” objective uses an ensemble of trees to model the log of the rate $F(x) = \log \lambda(x)$. The poisson loss function (negative log likelihood) for prediction $F_i = \log \lambda_i$ is $\ell(y_i, F_i) = -y_iF_i + e^{F_i}$ where $y_i$ is the count for observation $i$ and $F_i$ is the ensemble prediction.

Given the current prediction $\hat{F}_i$, what is the gradient and hessian for observation $i$?
Page 12 of the Taylor Expansion notes shows that each new iteration of boosting attempts to find the tree that minimizes $\sum_i w_i (z_i - \hat{f}(x_i))^2$. What are the values for $w_i$ and $z_i$ for the “poisson” objective (in terms of $\hat{\lambda}_i$ or $e^{\hat{F}_i}$).

Solution

Add solution here

b. LightGBM Tuning

Tune a lightgbm model on the training data to predict the number of total number of renters (cnt). Do not use registered or causal as predictors!

Use the “poisson” objective; this is a good starting place for count data. This sets the loss function to the negative Poisson log-likelihood.
You need to tune at least two parameters: one related to the complexity of the trees (e.g., tree depth) and another related to the complexity of the ensemble (e.g., number of trees/iterations). LightGBM documentation on parameter tuning. And LightGBM list of all parameters.
You are free to tune other parameters as well, just be cautious of how long you are willing to wait for results.

List relevant tuning parameter values, even those left at their default values. Indicate which values are non-default (either through tuning or just selecting). You can get these from the params element of a fitted lightgbm model, e.g., lgbm_fitted$params.
Indicate what method was used for tuning (e.g., type of cross-validation).

Solution

Add solution here

c. Evaluation

Make predictions on the test data and evaluate. Report the point estimate and 95% confidence interval for the poisson log loss and the mean absolute error.

Solution

Add solution here

Required R packages and Directories

Problem 1: Bike Sharing Data

a. Load data

b. Data Cleaning

c. Missing times

d. New predictors

e. Train-Test split

Problem 2: Predicting bike rentals

a. Poisson loss

b. LightGBM Tuning

c. Evaluation