A Blog About D4T4 & M47H

Model Stacking with Sklearn

10 December ’17

Stacking, also called meta ensembling, is a technique used to boost predictive accuracy by blending the predictions of multiple models. This technique is most effective when you have multiple, well-performing models which are not overly similar. Participants in Kaggle competitions will observe that winning solutions are often blends of multiple models, sometimes even models available in public notebooks! A nice write-up from Kaggle grand master Triskelion on using stacking in Kaggle competitions. Throw back: the winning solution to the NetFlix challenge, from team BellKor's Pragmatic Chaos, used a blend of hundreds of different models.


I recently sought to implement a simple model stack in sklearn. The mlxtend package has a StackingClassifier for this. However, there was one big problem with this class: it does not allow you to use out-of-sample predictions from input models to train the meta classifier. This is a huge problem! Otherwise, overfitting models will dominate the weights. I created my own class, leveraging the native FeatureUnion class to house the input models and cross_val_predict to generate out-of-sample predictions. For the meta classifier itself, I applied the logit function to the probabilities from the input models and fed them into a simple logistic regression.

import numpy as np

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression

from scipy.special import logit

# method for linking `predict_proba` to `transform`
def chop_col0(function):
    def wrapper(*args, **kwargs):
        return function(*args, **kwargs)[:,1:]
    return wrapper

def add_transform(classifiers):
    for key, classifier in classifiers:
        if isinstance(classifier, Pipeline):
            classifier = classifier.steps[-1][-1]
        classifier.transform = chop_col0(classifier.predict_proba)
        classifier.__class__.transform = chop_col0(classifier.__class__.predict_proba)
        # NOTE: need to add to class so `clone` in `cross_val_predict` works

# default function applies logit to probabilies and applies logistic regression
def default_meta_classifier():
    return Pipeline([
        ('logit', FunctionTransformer(lambda x: logit(np.clip(x, 0.001, 0.999)))),
        ('scaler', StandardScaler()),
        ('lr', LogisticRegression())

# stacking classifier
class StackingClassifier(Pipeline):

    def __init__(self, classifiers, meta_classifier=None, cv=3):
        self.classifiers = classifiers
        self.meta_classifier = meta_classifier if meta_classifier else default_meta_classifier() = cv
        self.steps = [('stack', FeatureUnion(self.classifiers)), ('meta', self.meta_classifier)]
        self.memory = None

    def add_dict_prefix(x, px):
        return {'%s__%s' % (px, k) : v for k,v in x.items()}

    def set_params(self, **kwargs):
        return super(StackingClassifier, self).set_params(**self.add_dict_prefix(kwargs, 'stack'))

    def fit(self, X, y):
        meta_features = cross_val_predict(FeatureUnion(self.classifiers), X, y,, method="transform"), y)
        for name, classifier in self.classifiers:
  , y)
        return self

You can see this code implemented here. I built a model to predict which recently hospitalized diabetic patients will be re-hospitalized within 30 days, using this dataset from UCI. My model stack contained a logistic regression with regularization, a random forest, and a gradient boosting (xgboost) model. Here is a summary of model performance:

LR+RF+XGB Model Stack0.6990824552912449
LR+RF+XGB Average0.6981398497127431
Random Forest0.6952079165690574
Logistic Regression0.684611003872049

As you can see, a simple average of the models outperforms any one model. And our model stack outperforms the simple average.

Another technique that I'd like to explore is feature-weighted linear stacking: tuning a meta model using interactions of meta features and input model predictions, the idea being that we can identify pockets of samples in which certain models perform best. More on this later!

Identifying Frequent Item Sets using Apache Beam/Dataflow

24 November ’17

I have used Google's serverless DW service, BigQuery, for several of my projects this past year. I recently started familiarizing myself with with Google's serverless data pipeline service, DataFlow. This post shows how to build a pipeline to identify frequently purchased item sets in market basket data from Instacart (3.2M orders, 32M total items purchased, 50K unique items purchased).

What is Apache Beam?

Apache Beam is a programming model for building and executing data processing pipelines. Pipelines are built using one of the Beam's supported SDKs (Java, Python) and executed using one of Beam's supported runners (Spark, Flink, Dataflow on GCP). Pipelines themselves are typically just MapReduce operations: filtering, transforming, grouping + aggregating, etc.

Streaming data processing and batch data processing are often treated as distinctly different things. Streaming = unbounded, low latency, low accuracy. Batch = bounded, high latency, high accuracy. Nathan Marz's popular Lambda Architecture calls for both, a "correct" batch layer with historical data and an "approximate" real-time layer with recent data. However, the merits of this architecture are increasingly being challenged.

Beam is built on the idea that streaming data processing is really a superset of batch data processing. Beam has native features (e.g. windowing, watermarking, triggering, accumulation rules) to handle one of the biggest challenges of streaming data sources, the skew between event time and processing time. A common execution pattern in GCP is to use Pub/Sub to capture events and Dataflow to process those events and ingest into BigQuery, GCS, or back into Pub/Sub.

This post will build a batch pipeline on a bounded data source, and, as such, will not showcase a lot of the great features Beam has for streaming data processing!

Apriori Algorithm

I wanted to do an analysis to identify association rules, e.g. when purchasing A, also purchase B. I used the Apriori algorithm which is completed in two steps: (1) identify frequent item sets with specified support and (2) identify association rules with specified confidence. Support and confidence definitions:

Apriori looks for increasingly larger item sets comprised of items from smaller item sets. This works because of the monotonicity property, which states that if is frequent then any subset is also frequent.


Market Basket Analysis Results

I used 5K orders (~0.1% of total orders) as my minimum support cutoff. For simplicity, I only tested for frequent item sets of sizes 1-3. Dataflow ran the pipeline in ~16 minutes, autoscaling up to 3 nodes for the majority of the job. Here is a picture of my Dataflow pipeline, rotated for convenience:

I identified 1,094 frequent single items, 883 frequent item pairs, and 53 frequent item triples. From that, I derived 45 association rules. Here are the top 10 rules ranked in terms of confidence:

Organic Raspberries, Organic Hass AvocadoBag of Organic Bananas11,40944.2%
Non Fat Raspberry YogurtIcelandic Style Skyr Blueberry Non-fat Yogurt7,22444.1%
Organic Large Extra Fancy Fuji Apple, Organic Hass AvocadoBag of Organic Bananas5,80443.5%
Apple Honeycrisp Organic, Organic Hass AvocadoBag of Organic Bananas6,65042.9%
Organic Avocado, Cucumber KirbyBanana6,59442.4%
Strawberries, Organic AvocadoBanana5,29041.5%
Total 2% Lowfat Greek Strained Yogurt with PeachTotal 2% with Strawberry Lowfat Greek Strained Yogurt8,01440.3%
Organic Whole Milk, Organic AvocadoBanana5,19039.7%
Bartlett PearsBanana13,68238.6%
Organic Cucumber, Organic Hass AvocadoBag of Organic Bananas6,73338.6%

You can find the code for these pipelines in my repo here. Cheers!

How to Deploy a Shiny App on Google Kubernetes Engine

02 October ’17

Shiny is an awesome tool for building interactive apps powered by R. There are a couple options for deploying Shiny apps. You can deploy to You can also deploy on your own machine using open source Shiny Server. This tutorial shows how to setup a Docker container for a Shiny app and deploy on Google Kubernetes Engine. And because deploying the "Hello World" example is entirely unsatisfying, I chose to build an app to help guide strategy for a game I recently played.

Farkle - A Game of Guts & Luck Probability

My wife and I recently stumbled upon this game in one of our relative's game closets. The game is very simple and is played with just 6 dice (how they get people to buy a game that is comprised of only 6 dice is beyond me). Different rolls are worth different amounts of points. 4 of a kind is worth 1000 points, 1-2-3-4-5-6 is worth 1500 points, three 6s is worth 600 points, etc. 1s and 5s not used in other combinations count as 100 and 50 points respectively. Detailed scoring rules here. For instance, a roll of 1-3-3-4-5-3 = 300 + 100 + 50 = 450 points.

Any dice that don't count towards your score can be rolled again. However, and this is the catch, if you score no points on a roll, you lose all points accumulated up to that point. So in the above example, we could choose to roll 1 die (from the non-scoring 4), but we will lose the 450 points that we've banked if we don't roll a 1 or a 5 (the only scoring options with a single die).

Here is a summary of the expected number of points and the probability of scoring zero points on a single roll:

Dice RemainingP(Points = 0)E[Points]

As expected, the more dice we roll, the more likely we are to not get zero points. Furthermore, since more high scoring options are available with each die, each incremental die gives us more expected points than the last.

If you manage to score using all 6 of the dice, you get to start rolling again with all 6 dice. A short-sighted player may observe that and thus not be willing to risk rolling that last die. However, those 500 and 550 scenarios are too low! The average 6 dice roll results in 0 points just 2.3% of the time and an average of points if not zero. Incorporating this into our expectation, the average number of points on the next two rolls is actually . And this is still conservative. You have the option to roll a third time after that second roll, when prudent, which will increase the expected number of points further. Though it still doesn't make sense to roll that last die, it's much closer than it appeared at face value.

In summary, at any point in time, we are making a decision about whether to continue rolling the dice or stop. The key parameters are (1) how many points we have in the bank and (2) how many dice are remaining. A good decision will be based not just on what might happen this roll but also on subsequent rolls. We are going to make a Shiny app to help us make these decisions.

Building & Deploying Our Shiny App

I started by doing a little up-front work to generate the state space of possible rolls. This computation is not reactive and only needs to be performed once prior to app initialization. Next, I created a recursive play function which determines the optional strategy (roll or stop) with parameters for how many points have been banked so far and how many dice are remaining. I gave the function a max recursion depth to limit computation time. I figured this is okay since (1) turns with a large number of rolls are quite improbable and thus contribute less to our decision making and (2) players become increasingly less likely to continue rolling as they accumulate more points since they have more to lose. Finally, I made the Shiny app. Building Shiny apps involves laying out inputs, outputs, and the logic that ties them together. This app is very simple. Just 26 lines of R!

On GCP, one option for deploying our Shiny app is spinning up a Compute Instance, installing all the necessary software (R, Shiny server, and other necessary R packages), and downloading the code for our app. A more organized approach is to instead use a container like Docker. Fundamentally, containers allow developers to separate applications from the environments in which they run. Containers package an application and its dependencies into a single manifest (that can be version controlled) that runs directly on top of an OS kernel.

Firstly, we need to setup a Dockerfile for our Docker image. I extended this image which installs R and Shiny Server. After that, I simply copy my app into the correct directory and do some initialization.

# start with image with R and Shiny server installed
FROM rocker/shiny

# copy files into correct directories
COPY ./shiny/ /srv/shiny-server/farkle/
RUN mv /srv/shiny-server/farkle/shiny-server.conf /etc/shiny-server/shiny-server.conf

# initialize some inputs for the app
WORKDIR /srv/shiny-server/farkle/
RUN mkdir -p data && \
  R -e "install.packages(c('dplyr'), repos='')" && \
  Rscript init.R

Next, a few commands to make our Docker image and verify that it works:

docker build -t farkle:latest .
docker run --rm -p 3838:3838 farkle:latest # test that it works locally

Then tag the image and push it to Google Container Repository:

export PROJECT_ID=$(gcloud config get-value project -q)
docker tag farkle${PROJECT_ID}/shiny-farkle:latest
gcloud docker -- push${PROJECT_ID}/shiny-farkle
gcloud container images list-tags${PROJECT_ID}/shiny-farkle

Finally, we're going to deploy this image on Google Kubernetes Engine. I used Terraform to define and create the GCP infrastructure components for this project: a Kubernetes clusters and a global static IP. Finally, we apply a Kubernetes manifest containing a deployment for our image and a service, connected to the static IP, to make the service externally accessible. I packaged my Kubernetes resources in a Helm chart, which you can use to inject values / variables via template directives (e.g. {{ ... }}).

terraform apply -var project=${PROJECT_ID}

gcloud container clusters get-credentials shiny-cluster
gcloud config set container/cluster shiny-cluster

helm init
helm install . --set projectId=${PROJECT_ID}

Terraform configuration:

variable "project" {}

variable "region" {
  default = "us-central1"

variable "zone" {
  default = "us-central1-f"

provider "google" {
  version = "~> 1.4"
  project = "${var.project}"
  region = "${var.region}"

resource "google_compute_global_address" "shiny-static-ip" {
  name = "shiny-static-ip"

resource "google_container_cluster" "shiny-cluster" {
  name = "shiny-cluster"
  zone = "${}"
  initial_node_count = "1"
  node_config {
    machine_type = "n1-standard-1"
    oauth_scopes = [""]

Kubernetes manifest:

apiVersion: v1
kind: Service
  name: shiny-farkle-service
  annotations: shiny-static-ip
    app: shiny-farkle
  type: LoadBalancer
  - port: 80
    targetPort: 3838
    app: shiny-farkle
apiVersion: extensions/v1beta1
kind: Deployment
  name: shiny-farkle-deploy
    app: shiny-farkle
  replicas: 1
        app: shiny-farkle
      - name: master
        imagePullPolicy: Always
        image:{{ .Values.projectId }}/shiny-farkle:latest
        - containerPort: 3838

Some Final Farkle Insights

The really important question is this: if I have X dice remaining, how many points must I have in the bank to NOT roll? Using our Shiny app (3 roll look-forward), I estimated these numbers:

Dice RemainingBank Threshold to Stop Rolling

The game is played to 10,000 points (with the lagging player getting a rebuttal opportunity). So there is virtually no scenario in which you would not roll 6 dice when given the opportunity! You can find a link to all of my work here and a link to the app deployed using this methodology here Cheers!

Note #1: A big simplification that I make on game play is that all dice that can be scored will be scored. In reality, players have the option not to score all dice. For instance, if I roll three 1s, I can choose to bank one 1 and roll the 5 remaining dice, which, using our app, makes sense. 100 in the bank and 5 remaining dice has a 352.0 expectation; 200 in the bank and 4 dice remaining has a 329.0 expectation.

Note #2: Of course, these estimates are agnostic to the game situation. In reality, you're trying to maximize your probability of winning, not your expected number of points. If you are down by 5000 points, you're going to need to be a lot more aggressive.


02-10-2018 Update: I moved hosting of this app to Now for purely financial reasons. They provide serverless deployments for Node.js and Docker. They also provide 3 instances for free!