Identifying Frequent Item Sets using Apache Beam/Dataflow
24 November ’17
I have used Google's serverless DW service, BigQuery, for several of my projects this past year. I recently started familiarizing myself with with Google's serverless data pipeline service, DataFlow. This post shows how to build a pipeline to identify frequently purchased item sets in market basket data from Instacart (3.2M orders, 32M total items purchased, 50K unique items purchased).
What is Apache Beam?
Apache Beam is a programming model for building and executing data processing pipelines. Pipelines are built using one of the Beam's supported SDKs (Java, Python) and executed using one of Beam's supported runners (Spark, Flink, Dataflow on GCP). Pipelines themselves are typically just MapReduce operations: filtering, transforming, grouping + aggregating, etc.
Streaming data processing and batch data processing are often treated as distinctly different things. Streaming = unbounded, low latency, low accuracy. Batch = bounded, high latency, high accuracy. Nathan Marz's popular Lambda Architecture calls for both, a "correct" batch layer with historical data and an "approximate" real-time layer with recent data. However, the merits of this architecture are increasingly being challenged.
Beam is built on the idea that streaming data processing is really a superset of batch data processing. Beam has native features (e.g. windowing, watermarking, triggering, accumulation rules) to handle one of the biggest challenges of streaming data sources, the skew between event time and processing time. A common execution pattern in GCP is to use Pub/Sub to capture events and Dataflow to process those events and ingest into BigQuery, GCS, or back into Pub/Sub.
This post will build a batch pipeline on a bounded data source, and, as such, will not showcase a lot of the great features Beam has for streaming data processing!
I wanted to do an analysis to identify association rules, e.g. when purchasing A, also purchase B. I used the Apriori algorithm which is completed in two steps: (1) identify frequent item sets with specified support and (2) identify association rules with specified confidence. Support and confidence definitions:
Apriori looks for increasingly larger item sets comprised of items from smaller item sets. This works because of the monotonicity property, which states that if is frequent then any subset is also frequent.
Market Basket Analysis Results
I used 5K orders (~0.1% of total orders) as my minimum support cutoff. For simplicity, I only tested for frequent item sets of sizes 1-3. Dataflow ran the pipeline in ~16 minutes, autoscaling up to 3 nodes for the majority of the job. Here is a picture of my Dataflow pipeline, rotated for convenience:
I identified 1,094 frequent single items, 883 frequent item pairs, and 53 frequent item triples. From that, I derived 45 association rules. Here are the top 10 rules ranked in terms of confidence:
|Organic Raspberries, Organic Hass Avocado||Bag of Organic Bananas||11,409||44.2%|
|Non Fat Raspberry Yogurt||Icelandic Style Skyr Blueberry Non-fat Yogurt||7,224||44.1%|
|Organic Large Extra Fancy Fuji Apple, Organic Hass Avocado||Bag of Organic Bananas||5,804||43.5%|
|Apple Honeycrisp Organic, Organic Hass Avocado||Bag of Organic Bananas||6,650||42.9%|
|Organic Avocado, Cucumber Kirby||Banana||6,594||42.4%|
|Strawberries, Organic Avocado||Banana||5,290||41.5%|
|Total 2% Lowfat Greek Strained Yogurt with Peach||Total 2% with Strawberry Lowfat Greek Strained Yogurt||8,014||40.3%|
|Organic Whole Milk, Organic Avocado||Banana||5,190||39.7%|
|Organic Cucumber, Organic Hass Avocado||Bag of Organic Bananas||6,733||38.6%|
You can find the code for these pipelines in my repo here. Cheers!