= "https://archive.ics.uci.edu/static/public/275/bike+sharing+dataset.zip"
data_url library(tidyverse)
DS 6030 | Fall 2024 | University of Virginia
Homework #8: Boosting
This is an independent assignment. Do not discuss or work with classmates.
Required R packages and Directories
Problem 1: Bike Sharing Data
This homework will work with bike rental data from Washington D.C.
a. Load data
Load the hourly Bikesharing
data from the UCI ML Repository.
b. Data Cleaning
Check out the variable descriptions in the Additional Variable Information. To prepare the data for modeling, do the following:
- Convert the
weathersit
to an ordered factor. - Unnormalize
temp
andatemp
and convert to Fahrenheit. - Unnormalize
windspeed
.
c. Missing times
Not every hour of every day is represented in these data. Some times, like 2011-03-15 hr=3, is due to daylight savings time. Other times, like 2011-01-02 hr=5, is probably due to the data collection process which ignored any times when cnt = 0
.
This may not be perfect, but do the following to account for missing times:
Create new rows/observations for all missing date-hr combinations that we think are due to actual zero counts. That is, exclude daylight savings. Set the outcome variables to zero (
causal = 0
,registered = 0
, andcnt = 0
) for these new observations.tidyr::complete()
can help.Fill in the other missing feature values with values from previous hour. For example, the
temp
for 2011-01-02 hr=5 should be set to thetemp
from the non-missing 2011-01-02 hr=4.tidyr::fill()
can help.
d. New predictors
- Add the variable
doy
to represent the day of the year (1-366). - Add the variable
days
to represent the fractional number of days since2011-01-01
. For example hr=2 of 2011-01-02 is 1.083. - Add lagged counts: autoregressive. Add the variable
cnt_ar
to be thecnt
in the previous hour. You will need to set the value forcnt_ar
for the 1st observation.
- Add lagged counts: same time previous day, or a lag of 24 hours. You will need to set the values for the first 24 hours.
Hints:
- The
lubridate
package (part oftidymodels
) is useful for dealing with dates and times. dplyr::lag()
can help with making the lagged variables.
e. Train-Test split
Randomly select 1000 observations for the test set and use the remaining for training.
Problem 2: Predicting bike rentals
a. Poisson loss
The outcome variables, number of renters, are counts (i.e., non-negative integers). For count data, the variance often scales with the expected count. One way to accommodate this is to model the counts as a Poisson distribution with rate \(\lambda_i = \lambda(x_i)\). In lightgbm, the “poisson” objective uses an ensemble of trees to model the log of the rate \(F(x) = \log \lambda(x)\). The poisson loss function (negative log likelihood) for prediction \(F_i = \log \lambda_i\) is \(\ell(y_i, F_i) = -y_iF_i + e^{F_i}\) where \(y_i\) is the count for observation \(i\) and \(F_i\) is the ensemble prediction.
- Given the current prediction \(\hat{F}_i\), what is the gradient and hessian for observation \(i\)?
- Page 12 of the Taylor Expansion notes shows that each new iteration of boosting attempts to find the tree that minimizes \(\sum_i w_i (z_i - \hat{f}(x_i))^2\). What are the values for \(w_i\) and \(z_i\) for the “poisson” objective (in terms of \(\hat{\lambda}_i\) or \(e^{\hat{F}_i}\)).
b. LightGBM Tuning
Tune a lightgbm model on the training data to predict the number of total number of renters (cnt
). Do not use registered
or causal
as predictors!
Use the “poisson” objective; this is a good starting place for count data. This sets the loss function to the negative Poisson log-likelihood.
You need to tune at least two parameters: one related to the complexity of the trees (e.g., tree depth) and another related to the complexity of the ensemble (e.g., number of trees/iterations). LightGBM documentation on parameter tuning. And LightGBM list of all parameters.
You are free to tune other parameters as well, just be cautious of how long you are willing to wait for results.
List relevant tuning parameter values, even those left at their default values. Indicate which values are non-default (either through tuning or just selecting). You can get these from the
params
element of a fitted lightgbm model, e.g.,lgbm_fitted$params
.Indicate what method was used for tuning (e.g., type of cross-validation).
c. Evaluation
Make predictions on the test data and evaluate. Report the point estimate and 95% confidence interval for the poisson log loss and the mean absolute error.