DS 6030 | Fall 2024 | University of Virginia

Homework #8: Boosting

Author

Your Name Here

Published

November 1, 2024

This is an independent assignment. Do not discuss or work with classmates.

Required R packages and Directories

data_url = "https://archive.ics.uci.edu/static/public/275/bike+sharing+dataset.zip"
library(tidyverse)

Problem 1: Bike Sharing Data

This homework will work with bike rental data from Washington D.C.

a. Load data

Load the hourly Bikesharing data from the UCI ML Repository.

Solution

Add solution here

b. Data Cleaning

Check out the variable descriptions in the Additional Variable Information. To prepare the data for modeling, do the following:

  1. Convert the weathersit to an ordered factor.
  2. Unnormalize temp and atemp and convert to Fahrenheit.
  3. Unnormalize windspeed.
Solution

Add solution here

c. Missing times

Not every hour of every day is represented in these data. Some times, like 2011-03-15 hr=3, is due to daylight savings time. Other times, like 2011-01-02 hr=5, is probably due to the data collection process which ignored any times when cnt = 0.

This may not be perfect, but do the following to account for missing times:

  1. Create new rows/observations for all missing date-hr combinations that we think are due to actual zero counts. That is, exclude daylight savings. Set the outcome variables to zero (causal = 0, registered = 0, and cnt = 0) for these new observations. tidyr::complete() can help.

  2. Fill in the other missing feature values with values from previous hour. For example, the temp for 2011-01-02 hr=5 should be set to the temp from the non-missing 2011-01-02 hr=4. tidyr::fill() can help.

Solution

Add solution here

d. New predictors

  1. Add the variable doy to represent the day of the year (1-366).
  2. Add the variable days to represent the fractional number of days since 2011-01-01. For example hr=2 of 2011-01-02 is 1.083.
  3. Add lagged counts: autoregressive. Add the variable cnt_ar to be the cnt in the previous hour. You will need to set the value for cnt_ar for the 1st observation.
  4. Add lagged counts: same time previous day, or a lag of 24 hours. You will need to set the values for the first 24 hours.

Hints:

  • The lubridate package (part of tidymodels) is useful for dealing with dates and times.
  • dplyr::lag() can help with making the lagged variables.
Solution

Add solution here

e. Train-Test split

Randomly select 1000 observations for the test set and use the remaining for training.

Solution

Add solution here

Problem 2: Predicting bike rentals

a. Poisson loss

The outcome variables, number of renters, are counts (i.e., non-negative integers). For count data, the variance often scales with the expected count. One way to accommodate this is to model the counts as a Poisson distribution with rate \(\lambda_i = \lambda(x_i)\). In lightgbm, the “poisson” objective uses an ensemble of trees to model the log of the rate \(F(x) = \log \lambda(x)\). The poisson loss function (negative log likelihood) for prediction \(F_i = \log \lambda_i\) is \(\ell(y_i, F_i) = -y_iF_i + e^{F_i}\) where \(y_i\) is the count for observation \(i\) and \(F_i\) is the ensemble prediction.

  • Given the current prediction \(\hat{F}_i\), what is the gradient and hessian for observation \(i\)?
  • Page 12 of the Taylor Expansion notes shows that each new iteration of boosting attempts to find the tree that minimizes \(\sum_i w_i (z_i - \hat{f}(x_i))^2\). What are the values for \(w_i\) and \(z_i\) for the “poisson” objective (in terms of \(\hat{\lambda}_i\) or \(e^{\hat{F}_i}\)).
Solution

Add solution here

b. LightGBM Tuning

Tune a lightgbm model on the training data to predict the number of total number of renters (cnt). Do not use registered or causal as predictors!

  • Use the “poisson” objective; this is a good starting place for count data. This sets the loss function to the negative Poisson log-likelihood.

  • You need to tune at least two parameters: one related to the complexity of the trees (e.g., tree depth) and another related to the complexity of the ensemble (e.g., number of trees/iterations). LightGBM documentation on parameter tuning. And LightGBM list of all parameters.

  • You are free to tune other parameters as well, just be cautious of how long you are willing to wait for results.

  1. List relevant tuning parameter values, even those left at their default values. Indicate which values are non-default (either through tuning or just selecting). You can get these from the params element of a fitted lightgbm model, e.g., lgbm_fitted$params.

  2. Indicate what method was used for tuning (e.g., type of cross-validation).

Solution

Add solution here

c. Evaluation

Make predictions on the test data and evaluate. Report the point estimate and 95% confidence interval for the poisson log loss and the mean absolute error.

Solution

Add solution here