DonaldRauscher.com

A Blog About D4T4 & M47H

Quick and Easy BI: Setting up Redash on GKE

31 December ’17

Professionally, I have worked quite a lot with BI platforms Looker and Tableau. They are great BI platforms for an organization, though probably too heavy (and too expensive) for a small project or a bootstrapping startup. Sometimes you just need something where you can write queries and dump them into a visualization. Recently, I was looking to implement a lightweight BI tool for a personal project. I chose to use Redash, which you can self-host on your own infrastructure. This post documents how to set up Redash on Google Cloud using GKE. Because I am using CloudSQL as the Postgres backend and a persistent drive for Redis, we can delete our cluster when we're not using it and spin it back up as needed, without losing any data!

Infrastructure Setup

We will use the following Google Cloud components to set up Redash:

  • Postgres DB (via CloudSQL)
  • Persistent disk for Redis instance
  • Kubernetes cluster for Redash Docker image

Here is a Terraform configuration which defines all the necessary infrastructure:

# infrastructure.tf

variable "project" {}

variable "postgres_user" {
  default = "redash"
}
variable "postgres_pw" {
  default = "hsader"
}

variable "region" {
  default = "us-central1"
}

variable "zone" {
  default = "us-central1-f"
}

provider "google" {
  version = "~> 1.4"
  project = "${var.project}"
  region = "${var.region}"
}

resource "google_compute_global_address" "redash-static-ip" {
  name = "redash-static-ip"
}

resource "google_compute_disk" "redash-redis-disk" {
  name  = "redash-redis-disk"
  type  = "pd-ssd"
  size = "200"
  zone  = "${var.zone}"
}

resource "google_sql_database_instance" "redash-db" {
  name = "redash-db"
  database_version = "POSTGRES_9_6"
  region = "${var.region}"
  settings {
    tier = "db-f1-micro"
  }
}

resource "google_sql_database" "redash-schema" {
  name = "redash"
  instance = "${google_sql_database_instance.redash-db.name}"
}

resource "google_sql_user" "proxyuser" {
  name = "${var.postgres_user}"
  password = "${var.postgres_pw}"
  instance = "${google_sql_database_instance.redash-db.name}"
  host = "cloudsqlproxy~%"
}

resource "google_container_cluster" "redash-cluster" {
  name = "redash-cluster"
  zone = "${var.zone}"
  initial_node_count = "1"
  node_config {
    machine_type = "n1-standard-4"
  }
}

Create our infrastructure with Terraform and install Helm Tiller on our Kubernetes cluster. You will also need to create a service account that the CloudSQL proxy on Kubernetes will use. Create that (Role = "Cloud SQL Client"), download the JSON key, and attach key as secret.

export PROJECT_ID=$(gcloud config get-value project -q)
terraform apply -var project=${PROJECT_ID}

gcloud container clusters get-credentials redash-cluster
gcloud config set container/cluster redash-cluster

helm init

kubectl create secret generic cloudsql-instance-credentials \
    --from-file=credentials.json=[PROXY_KEY_FILE_PATH]

Redash Deployment

Next, we need to deploy Redash on our Kubernetes cluster. I packaged my Kubernetes resources in a Helm chart, which you can use to inject values / variables via template directives (e.g. {{ ... }}). I used a Helm hook to set up the configuration and the database resources (CloudSQL proxy + Redis) and also run a job to initialize the Redash schema before deploying the app.

helm install . --set projectId=${PROJECT_ID}

Redash resources:

# config.yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: config
  annotations:
    "helm.sh/hook": pre-install
    "helm.sh/hook-weight": "1"
data:
  REDASH_DATABASE_URL: postgresql://{{ .Values.postgres.user }}:{{ .Values.postgres.pw }}@postgres:5432/redash
  REDASH_REDIS_URL: "redis://redis:6379/0"
  PYTHONUNBUFFERED: "0"
  REDASH_LOG_LEVEL: "INFO"
# db.yaml
---
kind: Service
apiVersion: v1
metadata:
  name: postgres
  annotations:
    "helm.sh/hook": pre-install
    "helm.sh/hook-weight": "2"
spec:
  type: ClusterIP
  selector:
    app: redash
    tier: postgres
  ports:
    - name: postgres
      port: 5432
---
kind: Service
apiVersion: v1
metadata:
  name: redis
  annotations:
    "helm.sh/hook": pre-install
    "helm.sh/hook-weight": "2"
spec:
  type: ClusterIP
  selector:
    app: redash
    tier: redis
  ports:
    - name: redis
      port: 6379
---
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
  name: postgres
  annotations:
    "helm.sh/hook": pre-install
    "helm.sh/hook-weight": "2"
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: redash
        tier: postgres
    spec:
      containers:
        - name: cloudsql-proxy
          image: gcr.io/cloudsql-docker/gce-proxy:1.11
          command: ["/cloud_sql_proxy", "--dir=/cloudsql",
                    "-instances={{ .Values.projectId }}:us-central1:redash-db=tcp:0.0.0.0:5432",
                    "-credential_file=/secrets/cloudsql/credentials.json"]
          ports:
            - name: postgres
              containerPort: 5432
          volumeMounts:
            - name: cloudsql-instance-credentials
              mountPath: /secrets/cloudsql
              readOnly: true
            - name: ssl-certs
              mountPath: /etc/ssl/certs
            - name: cloudsql
              mountPath: /cloudsql
      volumes:
        - name: cloudsql-instance-credentials
          secret:
            secretName: cloudsql-instance-credentials
        - name: cloudsql
          emptyDir:
        - name: ssl-certs
          hostPath:
            path: /etc/ssl/certs
---
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
  name: redis
  annotations:
    "helm.sh/hook": pre-install
    "helm.sh/hook-weight": "2"
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: redash
        tier: redis
    spec:
      containers:
        - name: redis
          image: redis:3.0-alpine
          ports:
            - name: redis
              containerPort: 6379
          volumeMounts:
            - name: redis-disk
              mountPath: /data/redis
      volumes:
        - name: redis-disk
          gcePersistentDisk:
            pdName: redash-redis-disk
            fsType: ext4

Redash DB initialization job:

# init.yaml
---
kind: Job
apiVersion: batch/v1
metadata:
  name: init-db
  annotations:
    "helm.sh/hook": pre-install
    "helm.sh/hook-weight": "3"
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: redash-init
          image: redash/redash:latest
          resources:
            requests:
              memory: 1Gi
          envFrom:
            - configMapRef:
                name: config
          args: ["create_db"]

Redash deployment:

# app.yaml
---
kind: Service
apiVersion: v1
metadata:
  name: redash
  annotations:
    kubernetes.io/ingress.global-static-ip-name: redash-static-ip
spec:
  type: LoadBalancer
  selector:
    app: redash
    tier: app
  ports:
    - port: 80
      targetPort: 5000
---
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
  name: redash
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: redash
        tier: app
    spec:
      containers:
        - name: server
          image: redash/redash:latest
          resources:
            requests:
              memory: 1Gi
          ports:
            - containerPort: 5000
          envFrom:
            - configMapRef:
                name: config
          env:
            - name: REDASH_COOKIE_SECRET
              value: {{ .Values.cookieSecret }}
          args: ["server"]
        - name: workers
          image: redash/redash:latest
          resources:
            requests:
              memory: 1Gi
          envFrom:
            - configMapRef:
                name: config
          env:
            - name: WORKERS_COUNT
              value: "{{ .Values.numWorkers }}"
            - name: QUEUES
              value: "queries,scheduled_queries,celery"
          args: ["scheduler"]

---

You can find all of my code up on my GitHub here. Cheers!

High Cardinality Categoricals with Sklearn

18 December ’17

I used a Bayesian approach to encode high cardinality categorical variables in a Kaggle a few months back. My original implementation was in R. However, I have recently been doing most of my modeling in sklearn, so I decided to also implement this approach there as well.

This approach lends itself to the sklearn framework very well! The fit method uses MLE to estimate a and b for the prior distribution and calculates descriptive stats for each level. The transform method calculates posterior probabilities on what is assumed to be out-of-sample data. And the fit_transform method runs fit and then calculates posterior probabilities, leaving out the current sample, and applying some noise (deterring overfitting in tree-based models) if specified.

You can see this method implemented in a familiar friend: my hospital readmission model. Cheers!

import pandas as pd
import numpy as np

from scipy.optimize import minimize
from scipy.special import beta

from sklearn.base import BaseEstimator
from sklearn.utils.validation import check_is_fitted, column_or_1d
from sklearn.utils.multiclass import type_of_target

# beta binomial density function
@np.vectorize
def dbetabinom(a, b, k, n):
    n2 = np.clip(n, 0, 100)
    k2 = round(k * n2 / n)
    return beta(k2 + a, n2 - k2 + b) / beta(a, b)

# beta binomial log loss
def betabinom_ll(par, arg):
    return np.sum(-np.log(dbetabinom(par[0], par[1], arg[0], arg[1])))

# default params for MLE
mle_param_defaults = dict(method = "L-BFGS-B", x0 = [1,1], bounds = [(0.5, 500), (0.5, 500)])

# encodes single high cardinality categorical variable
class SingleHCCEncoder(BaseEstimator):

    def __init__(self, add_noise = True, noise_sd = 0.05, mle_params = mle_param_defaults):
        self.add_noise = add_noise
        self.noise_sd = noise_sd
        self.mle_params = mle_params
        self.a, self.b = None, None
        self.df, self.df_dict = None, None

    # calibrate a and b of beta distribution
    def fit_beta(self):
        check_is_fitted(self, 'df')
        k, n = self.df.k, self.df.n
        mle = minimize(fun = betabinom_ll, args = [k, n], **self.mle_params)
        self.a, self.b = mle.x

    # descriptive stats for each level
    def fit_df(self, x, y):
        df = pd.DataFrame(data = dict(x = x, y = y))
        df = df.groupby(['x']).agg(['sum', 'count', 'mean'])
        df.columns = ['k', 'n', 'p']
        self.df = df
        self.df_dict = df.to_dict(orient = "index")

    @np.vectorize
    def transform_one_loo(self, x, y):
        xval = self.df_dict.get(x, dict(k = 0, n = 0))
        return (xval['k'] + self.a - y) * 1.0 / (xval['n'] + self.a + self.b - 1)

    @np.vectorize
    def transform_one(self, x):
        xval = self.df_dict.get(x, dict(k = 0, n = 0))
        return (xval['k'] + self.a) * 1.0 / (xval['n'] + self.a + self.b)

    def fit(self, x, y):
        assert(type_of_target(y) == "binary")
        x = column_or_1d(x)
        y = column_or_1d(y)
        self.fit_df(x, y)
        self.fit_beta()
        return self

    def fit_transform(self, x, y):
        self.fit(x, y)
        if self.add_noise:
            noise = self.noise_sd * np.random.randn(len(x)) + 1
        else:
            noise = np.repeat(1, len(x))
        return self.transform_one_loo(self, x, y) * noise

    def transform(self, x):
        check_is_fitted(self, 'df_dict')
        x = column_or_1d(x)
        return self.transform_one(self, x)

# encodes multiple high cardinality categorical variables
class HCCEncoder(BaseEstimator):

    def __init__(self, columns, hcc_encode_params = {}, seed = 1):
        self.columns = columns
        self.hcc_encode_params = hcc_encode_params
        self.seed = seed
        self.labellers = {}

    def fit(self, df, y):
        for c in self.columns:
            hcc_encode_params = self.hcc_encode_params.get(c, {})
            labeller = SingleHCCEncoder(**hcc_encode_params)
            labeller.fit(df[c], y)
            self.labellers[c] = labeller
        return self

    def fit_transform(self, df, y):
        np.random.seed(self.seed)
        df = df.copy()
        for c in self.columns:
            hcc_encode_params = self.hcc_encode_params.get(c, {})
            labeller = SingleHCCEncoder(**hcc_encode_params)
            df[c] = labeller.fit_transform(df[c], y)
            self.labellers[c] = labeller
        return df

    def transform(self, df):
        df = df.copy()
        for c in self.columns:
            df[c] = self.labellers[c].transform(df[c])
        return df

Model Stacking with Sklearn

10 December ’17

Stacking, also called meta ensembling, is a technique used to boost predictive accuracy by blending the predictions of multiple models. This technique is most effective when you have multiple, well-performing models which are not overly similar. Participants in Kaggle competitions will observe that winning solutions are often blends of multiple models, sometimes even models available in public notebooks! A nice write-up from Kaggle grand master Triskelion on using stacking in Kaggle competitions. Throw back: the winning solution to the NetFlix challenge, from team BellKor's Pragmatic Chaos, used a blend of hundreds of different models.

Source: https://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/

I recently sought to implement a simple model stack in sklearn. The mlxtend package has a StackingClassifier for this. However, there was one big problem with this class: it does not allow you to use out-of-sample predictions from input models to train the meta classifier. This is a huge problem! Otherwise, overfitting models will dominate the weights. I created my own class, leveraging the native FeatureUnion class to house the input models and cross_val_predict to generate out-of-sample predictions. For the meta classifier itself, I applied the logit function to the probabilities from the input models and fed them into a simple logistic regression.

import numpy as np

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression

from scipy.special import logit


# method for linking `predict_proba` to `transform`
def chop_col0(function):
    def wrapper(*args, **kwargs):
        return function(*args, **kwargs)[:,1:]
    return wrapper


def add_transform(classifiers):
    for key, classifier in classifiers:
        if isinstance(classifier, Pipeline):
            classifier = classifier.steps[-1][-1]
        classifier.transform = chop_col0(classifier.predict_proba)
        classifier.__class__.transform = chop_col0(classifier.__class__.predict_proba)
        # NOTE: need to add to class so `clone` in `cross_val_predict` works


# default function applies logit to probabilies and applies logistic regression
def default_meta_classifier():
    return Pipeline([
        ('logit', FunctionTransformer(lambda x: logit(np.clip(x, 0.001, 0.999)))),
        ('scaler', StandardScaler()),
        ('lr', LogisticRegression())
    ])


# stacking classifier
class StackingClassifier(Pipeline):

    def __init__(self, classifiers, meta_classifier=None, cv=3):
        add_transform(classifiers)
        self.classifiers = classifiers
        self.meta_classifier = meta_classifier if meta_classifier else default_meta_classifier()
        self.cv = cv
        self.steps = [('stack', FeatureUnion(self.classifiers)), ('meta', self.meta_classifier)]
        self.memory = None

    @staticmethod
    def add_dict_prefix(x, px):
        return {'%s__%s' % (px, k) : v for k,v in x.items()}

    def set_params(self, **kwargs):
        return super(StackingClassifier, self).set_params(**self.add_dict_prefix(kwargs, 'stack'))

    def fit(self, X, y):
        meta_features = cross_val_predict(FeatureUnion(self.classifiers), X, y, cv=self.cv, method="transform")
        self.meta_classifier.fit(meta_features, y)
        for name, classifier in self.classifiers:
            classifier.fit(X, y)
        return self

You can see this code implemented here. I built a model to predict which recently hospitalized diabetic patients will be re-hospitalized within 30 days, using this dataset from UCI. My model stack contained a logistic regression with regularization, a random forest, and a gradient boosting (xgboost) model. Here is a summary of model performance:

ModelAUC
LR+RF+XGB Model Stack0.6990824552912449
LR+RF+XGB Average0.6981398497127431
XGBoost0.6956653497449965
Random Forest0.6952079165690574
Logistic Regression0.684611003872049

As you can see, a simple average of the models outperforms any one model. And our model stack outperforms the simple average.

Another technique that I'd like to explore is feature-weighted linear stacking: tuning a meta model using interactions of meta features and input model predictions, the idea being that we can identify pockets of samples in which certain models perform best. More on this later!