Instacart is an on-line grocery delivery company trying to compete against the likes of Amazon, Shipt, etc. And like many of the successful companies these days, data drives a large part of their business decision making.

As a way to find good data scientists and get ideas from the community, Instacart has released some of their data. Here is a portion of the press release:

Curious about the food Americans eat? Look no further.

Instacart is excited to announce our first public dataset release, “The Instacart Online Grocery Shopping Dataset 2017”. This anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users.

For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders.

Wow! This can be a handy dataset for learning about and practicing many data mining concepts.

Instacart Data

order_id user_id product_id product_name
1 112108 49302 Bulgarian Yogurt
1 112108 11109 Organic 4% Milk Fat Whole Milk Cottage Cheese
1 112108 10246 Organic Celery Hearts
1 112108 49683 Cucumber Kirby
1 112108 43633 Lightly Smoked Sardines in Olive Oil
1 112108 13176 Bag of Organic Bananas
1 112108 47209 Organic Hass Avocado
1 112108 22035 Organic Whole String Cheese
36 79431 39612 Grated Pecorino Romano Cheese
36 79431 19660 Spring Water
36 79431 49235 Organic Half & Half
36 79431 43086 Super Greens Salad
36 79431 46620 Cage Free Extra Large Grade AA Eggs
36 79431 34497 Prosciutto, Americano
36 79431 48679 Organic Garnet Sweet Potato (Yam)

Questions

  1. In the press release a screen shot is shown that declares customers who bought {Hass Avocado, Small} frequently bought {Red Vine Tomato} and {Yellow Onions, Loose}.
    1. Why would Instacart show this to a customer?
    2. How would you find other rules like this one (Avocado <–> Tomato and Onion)? Note: this is sometimes referred to as a recommender system.
    3. How do you think they define frequently in this situation?
  2. Besides frequent co-occurrence, what other associations between items do you think Instacart might be interested in?
    • How will your associations(s) be affected if some of the items are rare?

Practice

  1. Download the data and load into R.
  2. If successful in the last step, explore the data: calculate a summary statistic, make a visualization.

Further Reading