SYS 6018 | Spring 2024 | University of Virginia

Homework #10: Clustering

Author

Your Name Here

Published

April 16, 2024

Required R packages and Directories

data_dir = 'https://mdporter.github.io/SYS6018/data/' # data directory
library(mclust)    # for model-based clustering
library(mixtools)  # for poisson mixture mode
library(tidyverse) # functions for data manipulation   

Problem 1: Customer Segmentation with RFM (Recency, Frequency, and Monetary Value)

RFM analysis is an approach that some businesses use to understand their customers’ activities. At any point in time, a company can measure how recently a customer purchased a product (Recency), how many times they purchased a product (Frequency), and how much they have spent (Monetary Value). There are many ad-hoc attempts to segment/cluster customers based on the RFM scores (e.g., here is one based on using the customers’ rank of each dimension independently: https://joaocorreia.io/blog/rfm-analysis-increase-sales-by-segmenting-your-customers.html). In this problem you will use the clustering methods we covered in class to segment the customers.

The data for this problem can be found here: https://mdporter.github.io/SYS6018/data//RFM.csv. Cluster based on the Recency, Frequency, and Monetary value columns.

Solution

Load Data Here.

a. Implement hierarchical clustering.

  • Describe any pre-processing steps you took (e.g., scaling, distance metric)
  • State the linkage method you used with justification.
  • Show the resulting dendrogram
  • State the number of segments/clusters you used with justification.
  • Using your segmentation, are customers 1 and 100 in the same cluster?
Solution

Add solution here

b. Implement k-means.

  • Describe any pre-processing steps you took (e.g., scaling)
  • State the number of segments/clusters you used with justification.
  • Using your segmentation, are customers 1 and 100 in the same cluster?
Solution

Add solution here

c. Implement model-based clustering

  • Describe any pre-processing steps you took (e.g., scaling)
  • State the number of segments/clusters you used with justification.
  • Describe the best model. What restrictions are on the shape of the components?
  • Using your segmentation, are customers 1 and 100 in the same cluster?
Solution

Add solution here

d. Discussion of results

Discuss how you would cluster the customers if you had to do this for your job. Do you think one model would do better than the others?

Solution

Add solution here

Problem 2: Poisson Mixture Model

The pmf of a Poisson random variable is: \[\begin{align*} f_k(x; \lambda_k) = \frac{\lambda_k^x e^{-\lambda_k}}{x!} \end{align*}\]

A two-component Poisson mixture model can be written: \[\begin{align*} f(x; \theta) = \pi \frac{\lambda_1^x e^{-\lambda_1}}{x!} + (1-\pi) \frac{\lambda_2^x e^{-\lambda_2}}{x!} \end{align*}\]

a. Model parameters

What are the parameters of the model?

Solution

Add solution here

b. Log-likelihood

Write down the log-likelihood for \(n\) independent observations (\(x_1, x_2, \ldots, x_n\)).

Solution

\[ \log L(\theta) = \sum_{i=1}^n \log \left(\pi \frac{\lambda_1^{x_i} e^{-\lambda_1}}{x_i!} + (1-\pi) \frac{\lambda_2^{x_i} e^{-\lambda_2}}{x_i!} \right) \]

c. Updating the responsibilities

Suppose we have initial values of the parameters. Write down the equation for updating the responsibilities.

Solution

Add solution here

d. Updating the model parameters

Suppose we have responsibilities, \(r_{ik}\) for all \(i=1, 2, \ldots, n\) and \(k=1,2\). Write down the equations for updating the parameters.

Solution

Add solution here

e. Fit a two-component Poisson mixture model

Fit a two-component Poisson mixture model. Report the estimated parameter values and show a plot of the estimated mixture pmf for the following data:

#-- Run this code to generate the data
set.seed(123)             # set seed for reproducibility
n = 200                   # sample size
z = sample(1:2, size=n, replace=TRUE, prob=c(.25, .75)) # sample the latent class
theta = c(8, 16)          # true parameters
y = ifelse(z==1, rpois(n, lambda=theta[1]), rpois(n, lambda=theta[2]))
  • Note: The function poisregmixEM() in the R package mixtools is designed to estimate a mixture of Poisson regression models. We can still use this function for our problem of pmf estimation if it is recast as an intercept-only regression. To do so, set the \(x\) argument (predictors) to x = rep(1, length(y)) and addintercept = FALSE.
    • Look carefully at the output from this model. The outputs use different names/symbols than what we used in the course notes. The beta values (regression coefficients) are on the log scale.
Solution

Add solution here

f. 2 pts Extra Credit EM from scratch

Write a function that estimates this two-component Poisson mixture model using the EM approach. Show that it gives the same result as part e.

  • Note: you are not permitted to copy code. Write everything from scratch and use comments to indicate how the code works (e.g., the E-step, M-step, initialization strategy, and convergence should be clear).
  • Cite any resources you consulted to help with the coding.
Solution

Add solution here