# DonaldRauscher.com

## A Blog About D4T4 & M47H

I used a Bayesian approach to encode high cardinality categorical variables in a Kaggle a few months back. My original implementation was in R. However, I have recently been doing most of my modeling in sklearn, so I decided to also implement this approach there as well.

This approach lends itself to the sklearn framework very well! The fit method uses MLE to estimate a and b for the prior distribution and calculates descriptive stats for each level. The transform method calculates posterior probabilities on what is assumed to be out-of-sample data. And the fit_transform method runs fit and then calculates posterior probabilities, leaving out the current sample, and applying some noise (deterring overfitting in tree-based models) if specified.

You can see this method implemented in a familiar friend: my hospital readmission model. Cheers!

import pandas as pd
import numpy as np

from scipy.optimize import minimize
from scipy.special import beta

from sklearn.base import BaseEstimator
from sklearn.utils.validation import check_is_fitted, column_or_1d
from sklearn.utils.multiclass import type_of_target

# beta binomial density function
@np.vectorize
def dbetabinom(a, b, k, n):
n2 = np.clip(n, 0, 100)
k2 = round(k * n2 / n)
return beta(k2 + a, n2 - k2 + b) / beta(a, b)

# beta binomial log loss
def betabinom_ll(par, arg):
return np.sum(-np.log(dbetabinom(par[0], par[1], arg[0], arg[1])))

# default params for MLE
mle_param_defaults = dict(method = "L-BFGS-B", x0 = [1,1], bounds = [(0.5, 500), (0.5, 500)])

# encodes single high cardinality categorical variable
class SingleHCCEncoder(BaseEstimator):

def __init__(self, add_noise = True, noise_sd = 0.05, mle_params = mle_param_defaults):
self.noise_sd = noise_sd
self.mle_params = mle_params
self.a, self.b = None, None
self.df, self.df_dict = None, None

# calibrate a and b of beta distribution
def fit_beta(self):
check_is_fitted(self, 'df')
k, n = self.df.k, self.df.n
mle = minimize(fun = betabinom_ll, args = [k, n], **self.mle_params)
self.a, self.b = mle.x

# descriptive stats for each level
def fit_df(self, x, y):
df = pd.DataFrame(data = dict(x = x, y = y))
df = df.groupby(['x']).agg(['sum', 'count', 'mean'])
df.columns = ['k', 'n', 'p']
self.df = df
self.df_dict = df.to_dict(orient = "index")

@np.vectorize
def transform_one_loo(self, x, y):
xval = self.df_dict.get(x, dict(k = 0, n = 0))
return (xval['k'] + self.a - y) * 1.0 / (xval['n'] + self.a + self.b - 1)

@np.vectorize
def transform_one(self, x):
xval = self.df_dict.get(x, dict(k = 0, n = 0))
return (xval['k'] + self.a) * 1.0 / (xval['n'] + self.a + self.b)

def fit(self, x, y):
assert(type_of_target(y) == "binary")
x = column_or_1d(x)
y = column_or_1d(y)
self.fit_df(x, y)
self.fit_beta()
return self

def fit_transform(self, x, y):
self.fit(x, y)
noise = self.noise_sd * np.random.randn(len(x)) + 1
else:
noise = np.repeat(1, len(x))
return self.transform_one_loo(self, x, y) * noise

def transform(self, x):
check_is_fitted(self, 'df_dict')
x = column_or_1d(x)
return self.transform_one(self, x)

# encodes multiple high cardinality categorical variables
class HCCEncoder(BaseEstimator):

def __init__(self, columns, hcc_encode_params = {}, seed = 1):
self.columns = columns
self.hcc_encode_params = hcc_encode_params
self.seed = seed
self.labellers = {}

def fit(self, df, y):
for c in self.columns:
hcc_encode_params = self.hcc_encode_params.get(c, {})
labeller = SingleHCCEncoder(**hcc_encode_params)
labeller.fit(df[c], y)
self.labellers[c] = labeller
return self

def fit_transform(self, df, y):
np.random.seed(self.seed)
df = df.copy()
for c in self.columns:
hcc_encode_params = self.hcc_encode_params.get(c, {})
labeller = SingleHCCEncoder(**hcc_encode_params)
df[c] = labeller.fit_transform(df[c], y)
self.labellers[c] = labeller
return df

def transform(self, df):
df = df.copy()
for c in self.columns:
df[c] = self.labellers[c].transform(df[c])
return df


Stacking, also called meta ensembling, is a technique used to boost predictive accuracy by blending the predictions of multiple models. This technique is most effective when you have multiple, well-performing models which are not overly similar. Participants in Kaggle competitions will observe that winning solutions are often blends of multiple models, sometimes even models available in public notebooks! A nice write-up from Kaggle grand master Triskelion on using stacking in Kaggle competitions. Throw back: the winning solution to the NetFlix challenge, from team BellKor's Pragmatic Chaos, used a blend of hundreds of different models.

I recently sought to implement a simple model stack in sklearn. The mlxtend package has a StackingClassifier for this. However, there was one big problem with this class: it does not allow you to use out-of-sample predictions from input models to train the meta classifier. This is a huge problem! Otherwise, overfitting models will dominate the weights. I created my own class, leveraging the native FeatureUnion class to house the input models and cross_val_predict to generate out-of-sample predictions. For the meta classifier itself, I applied the logit function to the probabilities from the input models and fed them into a simple logistic regression.

import numpy as np

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression

from scipy.special import logit

# method for linking predict_proba to transform
def chop_col0(function):
def wrapper(*args, **kwargs):
return function(*args, **kwargs)[:,1:]
return wrapper

for key, classifier in classifiers:
if isinstance(classifier, Pipeline):
classifier = classifier.steps[-1][-1]
classifier.transform = chop_col0(classifier.predict_proba)
classifier.__class__.transform = chop_col0(classifier.__class__.predict_proba)
# NOTE: need to add to class so clone in cross_val_predict works

# default function applies logit to probabilies and applies logistic regression
def default_meta_classifier():
return Pipeline([
('logit', FunctionTransformer(lambda x: logit(np.clip(x, 0.001, 0.999)))),
('scaler', StandardScaler()),
('lr', LogisticRegression())
])

# stacking classifier
class StackingClassifier(Pipeline):

def __init__(self, classifiers, meta_classifier=None, cv=3):
self.classifiers = classifiers
self.meta_classifier = meta_classifier if meta_classifier else default_meta_classifier()
self.cv = cv
self.steps = [('stack', FeatureUnion(self.classifiers)), ('meta', self.meta_classifier)]
self.memory = None

@staticmethod
return {'%s__%s' % (px, k) : v for k,v in x.items()}

def set_params(self, **kwargs):

def fit(self, X, y):
meta_features = cross_val_predict(FeatureUnion(self.classifiers), X, y, cv=self.cv, method="transform")
self.meta_classifier.fit(meta_features, y)
for name, classifier in self.classifiers:
classifier.fit(X, y)
return self


You can see this code implemented here. I built a model to predict which recently hospitalized diabetic patients will be re-hospitalized within 30 days, using this dataset from UCI. My model stack contained a logistic regression with regularization, a random forest, and a gradient boosting (xgboost) model. Here is a summary of model performance:

ModelAUC
LR+RF+XGB Model Stack0.6990824552912449
LR+RF+XGB Average0.6981398497127431
XGBoost0.6956653497449965
Random Forest0.6952079165690574
Logistic Regression0.684611003872049

As you can see, a simple average of the models outperforms any one model. And our model stack outperforms the simple average.

Another technique that I'd like to explore is feature-weighted linear stacking: tuning a meta model using interactions of meta features and input model predictions, the idea being that we can identify pockets of samples in which certain models perform best. More on this later!

I have used Google's serverless DW service, BigQuery, for several of my projects this past year. I recently started familiarizing myself with with Google's serverless data pipeline service, DataFlow. This post shows how to build a pipeline to identify frequently purchased item sets in market basket data from Instacart (3.2M orders, 32M total items purchased, 50K unique items purchased).

## What is Apache Beam?

Apache Beam is a programming model for building and executing data processing pipelines. Pipelines are built using one of the Beam's supported SDKs (Java, Python) and executed using one of Beam's supported runners (Spark, Flink, Dataflow on GCP). Pipelines themselves are typically just MapReduce operations: filtering, transforming, grouping + aggregating, etc.

Streaming data processing and batch data processing are often treated as distinctly different things. Streaming = unbounded, low latency, low accuracy. Batch = bounded, high latency, high accuracy. Nathan Marz's popular Lambda Architecture calls for both, a "correct" batch layer with historical data and an "approximate" real-time layer with recent data. However, the merits of this architecture are increasingly being challenged.

Beam is built on the idea that streaming data processing is really a superset of batch data processing. Beam has native features (e.g. windowing, watermarking, triggering, accumulation rules) to handle one of the biggest challenges of streaming data sources, the skew between event time and processing time. A common execution pattern in GCP is to use Pub/Sub to capture events and Dataflow to process those events and ingest into BigQuery, GCS, or back into Pub/Sub.

This post will build a batch pipeline on a bounded data source, and, as such, will not showcase a lot of the great features Beam has for streaming data processing!

## Apriori Algorithm

I wanted to do an analysis to identify association rules, e.g. when purchasing A, also purchase B. I used the Apriori algorithm which is completed in two steps: (1) identify frequent item sets with specified support and (2) identify association rules with specified confidence. Support and confidence definitions:

Apriori looks for increasingly larger item sets comprised of items from smaller item sets. This works because of the monotonicity property, which states that if is frequent then any subset is also frequent.

I used 5K orders (~0.1% of total orders) as my minimum support cutoff. For simplicity, I only tested for frequent item sets of sizes 1-3. Dataflow ran the pipeline in ~16 minutes, autoscaling up to 3 nodes for the majority of the job. Here is a picture of my Dataflow pipeline, rotated for convenience:

I identified 1,094 frequent single items, 883 frequent item pairs, and 53 frequent item triples. From that, I derived 45 association rules. Here are the top 10 rules ranked in terms of confidence:

LHSLHSSupportConfidence
Organic Raspberries, Organic Hass AvocadoBag of Organic Bananas11,40944.2%
Non Fat Raspberry YogurtIcelandic Style Skyr Blueberry Non-fat Yogurt7,22444.1%
Organic Large Extra Fancy Fuji Apple, Organic Hass AvocadoBag of Organic Bananas5,80443.5%
Apple Honeycrisp Organic, Organic Hass AvocadoBag of Organic Bananas6,65042.9%