DonaldRauscher.com

A Blog About D4T4 & M47H

Building Pipelines in K8s with Brigade

14 July ’18

Kubernetes started as a deployment option for stateless services. However, people are increasingly using Kubernetes clusters to execute complex workflows for CI/CD, ETL, machine learning, etc. And there are a number of tools/projects that have sprung up to help orchestrate these workflows. Two that I have been exploring are Argo (from Applatix) and Brigade (from DEIS, now Microsoft, the same folks who developed the popular K8s package manager Helm).

The container is, of course, at the center of both of these frameworks. Each step in the pipeline is a job that is executed by a Docker container. The major difference between Argo and Brigade is how they specify pipelines. In Argo, pipelines are declared with YAML. In Brigade, pipelines are scripted with JavaScript. Co-creator Matt Butcher provides a great explanation for why they chose this approach . I found this idea really interesting, so I chose to take Brigade for a spin.

I built a simple pipeline which loads cryptocurrency prices from CoinAPI into Google BigQuery. I also used MailGun to send notifications when pipelines complete/fail.

1. Creating Container for Pipeline Steps

Firstly, we need a Docker container that will execute pipeline steps. In my simple use case, I was able to use a single image, but you could just as easily use a different image for each step. I used a Google Cloud Container Builder image as my base image. This contains gcloud, kubectl, gsutil, and bq utilities. To that, I added a tool called jq, which I used to convert JSON into newline-delimited JSON for the BigQuery import.

FROM gcr.io/cloud-builders/gcloud:latest

RUN apt-get update \
  && apt-get install -y wget \
  && rm -rf /var/lib/apt/lists/*

RUN wget -O /usr/local/bin/jq https://github.com/stedolan/jq/releases/download/jq-1.5/jq-linux64 \
  && chmod +x /usr/local/bin/jq

ENTRYPOINT ["bash"]

2. Creating Brigade Project

Next, I needed to create a Brigade project. The Brigade project serves as an execution context for our pipeline. Brigade projects can be easily created with the brigade-project Helm chart. The project contains a link to a Git repo, which should contain a brigade.js script for our pipeline. It also contains secrets that can be referenced throughout our pipeline.

# values.yaml
project: donald/crypto
namespace: brigade
repository: github.com/donaldrauscher/brigade-crypto
cloneURL: https://github.com/donaldrauscher/brigade-crypto.git
# secrets.yaml
secrets:
  projectId: ...
  coinAPIKey: ...
  mailgunAPIKey: ...
helm install brigade/brigade-project -f values.yaml,secrets.yaml --namespace brigade

3. Creating the Brigade Pipeline

Now we need to set up a Javascript script which defines our pipeline. We can pass any script to Brigade at runtime, but this is discouraged; this script should ideally be placed in the Git repo that is referenced in the Brigade project. Some good documentation on how to write Brigade scripts.

\\brigade.js
const { events, Job } = require("brigadier")

function makeImg(p) {
  return "gcr.io/" + p.secrets.projectId + "/brigade-crypto:latest"
}

function mailgunCmd(e, p) {
  var key = p.secrets.mailgunAPIKey

  if (e.cause.trigger == 'success'){
    var msg = "Build " + e.cause.event.buildID + " ran successfully"
  } else {
    var msg = e.cause.reason
  }

  return `
    curl -s --user "api:${key}" https://api.mailgun.net/v3/mg.donaldrauscher.com/messages \
    -F from="mg@donaldrauscher.com" \
    -F to="donald.rauscher@gmail.com" \
    -F subject="Brigade Notification" \
    -F text="${msg}"
  `
}

events.on("exec", (e, p) => {
  var j1 = new Job("j1", makeImg(p))

  j1.storage.enabled = false

  j1.env = {
    "COIN_API_KEY": p.secrets.coinAPIKey,
    "TIMESTAMP": e.payload.trim()
  }

  j1.tasks = [
    "export TIMESTAMP=${TIMESTAMP:-$(date '+%Y-%m-%dT%H:%M')}",
    "curl https://rest.coinapi.io/v1/quotes/current?filter_symbol_id=_SPOT_ --request GET --header \"X-CoinAPI-Key: $COIN_API_KEY\" --fail -o quotes.json",
    "jq --compact-output '.[]' quotes.json > quotes.ndjson",
    "gsutil cp quotes.ndjson gs://djr-data/crypto/$TIMESTAMP/quotes.ndjson",
    "bq load --replace --source_format=NEWLINE_DELIMITED_JSON crypto.quotes gs://djr-data/crypto/$TIMESTAMP/quotes.ndjson"
  ]

  j1.run()
})

events.on("after", (e, p) => {
  var a1 = new Job("a1", makeImg(p))
  var cmd = mailgunCmd(e, p)
  a1.storage.enabled = false
  a1.tasks = [cmd]
  a1.run()
})

events.on("error", (e, p) => {
  var e1 = new Job("e1", makeImg(p))
  var cmd = mailgunCmd(e, p)
  e1.storage.enabled = false
  e1.tasks = [cmd]
  e1.run()
})

4. Testing Pipeline

Finally, we can test our pipeline. To manually trigger builds and check the status of builds, you will need the brig command line tool. You can download this from one of the Brigade releases.

brig run donald/crypto -f brigade.js -n brigade
export BRIG_PROJECT_ID=$(brig project list -n brigade | grep "donald/crypto" | head -1 | awk '{ print $2 }')
export BRIG_BUILD_ID=$(brig build list -n brigade | grep "$BRIG_PROJECT_ID" | tail -1 | awk '{ print $1 }')
brig build logs $BRIG_BUILD_ID -n brigade
kubectl logs j1-$BRIG_BUILD_ID -n brigade

I also adapted this example from Matt Butcher to create a cronjob to kick off this pipeline periodically. My main revision was to insert the timestamp into the event payload using a K8s init container.

===

Overall, I am really impressed with Brigade. I'm really excited to use it more You can find a link to all of my work here. Cheers!

Topic Modeling Fake News

26 June ’18

I decided to change things up a little bit and take on an unsupervised learning task: topic modeling. For this, I explored an endlessly entertaining dataset, a database of fake news articles compiled by Kaggle. It is comprised of ~13K different articles from 200 different sources circa Oct'16 - Nov'16 (a period which coincided with the 2016 US Presidential Election).

First, I set up a pipeline to express each article as a bag-of-words. I included a few preprocessing steps to clean the data (lowercase, remove URLs, remove HTML tags, and detect common phrases). I also used TF-IDF (term frequency - inverse document frequency) to assign higher weights to more "important" words. For the topic model itself, I used non-negative matrix factorization (NMF). Other popular approaches include probabilistic models like LDA.

Lowercase
Remove links
Remove HTML tags
[Not supported by viewer]
Tokenize
Tokenize
Phraser
Phraser
Filter Stop Words
Filter Stop Words
Min/Max Term Filtering + Vectorization
Min/Max Term Filtering + Vectorization
NMF
NMF
TF-IDF
TF-IDF
Corpus of
Raw Documents
Corpus of <br>Raw Documents
Document Term
Matrix
Document Term<br>Matrix<br>

Whether using NMF or LDA, determining the number of topics to model is always a challenge. If topics are too big, our topics will be unspecific and too broad. If topics are too small, our topics will be overly descriptive and too narrow. We can build models with different numbers of topics, but how do we evaluate model performance? Two options:

  1. Topic Cohesiveness - The extent to which a topic’s top keywords are semantically related in our overall corpus. We can use word embeddings created with Word2Vec to measure the similarity between keywords. Higher = better.
  2. Topic Generality - The extent to which each topic's top keywords overlap with other topics' top keywords. We can use mean pairwise Jaccard similarity between topics for this. Lower = better.

Source: D. O’Callaghan, D. Greene, J. Carthy, and P. Cunningham, “An Analysis of the Coherence of Descriptors in Topic Modeling,” Expert Systems with Applications (ESWA), 2015. [link]

I selected a number of topics that balanced these two measures (n=40). The below chart summarizes the frequency and top keywords for each topic in the final model. Overall, I found the topics to be highly cohesive and relevant!

Note: I found that my solutions were highly sensitive to NMF regularization. When using an L1 regularization, I had one very large topic and a long tail of small topics. I had a similar problem to a lesser degree when using L2 regularization. In the end, I chose to use no regularization because I was okay with articles belonging to multiple topics.

Link to my GH repo here. You can also play around with the model itself on Binder, a free-to-use (!) deployment of BinderHub.

Binder

Moving My Blog from Jekyll/GitHub to Pelican/GCS

28 May ’18

I recently moved this blog from Jekyll/GitHub to Pelican/GCS. Mainly, I wanted to move to a Python-based framework where I would have more flexibility to customize (e.g. add/create plugins). And cost isn't really a consideration as both options are free. GitHub pages is actually powered by Jekyll and gives you free hosting on GitHub's github.io domain. The fully-rendered blog is <<1Gb, and you get 5Gb of GCS storage for free.

I used Google Container Builder for CI/CD to publish my website on a GCS bucket, drawing upon a lot of ideas from this tutorial. The build has the following steps:

  1. Pull from GitHub - This performs a shallow git clone of my repo on GitHub.
  2. Generate site with Pelican - I needed to create a custom Google Container Builder step for this. See Dockerfile below. In addition to installing Pelican and some other Python depenencies, this container also installs Sass for CSS.
  3. Push to GCS Bucket - This uploads the HTML generated by Pelican to a GCS bucket using gsutil rsync.

I needed to update a few settings on the GCS bucket to serve the site, namely setting a link to the index page (MainPageSuffix) and making the site globally readable. Finally, I set up build triggers; whenever I push to my master branch, it automatically triggers a build.

Overall, I'm loving Pelican so far. You can find my blog's new repo here. Cheers!

cloudbuild.yaml

steps:
  - name: gcr.io/cloud-builders/git
    args: ['clone', '-b', '${_BRANCH}', '--single-branch', '--depth', '1', 'https://github.com/donaldrauscher/blog-pelican.git']
  - name: gcr.io/${PROJECT_ID}/pelican:latest
    args: ["content", "-v"]
    dir: blog-pelican
  - name: gcr.io/cloud-builders/gcloud
    entrypoint: gsutil
    args: ["-m", "rsync", "-r", "-c", "-d", "./output", "gs://${_SUB_DOMAIN}.donaldrauscher.com"]
    dir: blog-pelican
substitutions:
  _BRANCH: master
  _SUB_DOMAIN: www

Dockerfile for Pelican GCB step

FROM gcr.io/cloud-builders/gcloud

ENV SASS_VERSION 1.3.2
ENV PATH /builder/dart-sass:${PATH}

COPY requirements.txt .

# requirements.txt:
# blinker==1.4        Markdown==2.6.11        pytz==2018.4
# docutils==0.14      MarkupSafe==1.0         six==1.11.0
# feedgenerator==1.9  pelican==3.7.1          Unidecode==1.0.22
# Jinja2==2.10        Pygments==2.2.0         webassets==0.12.1
# jsmin==2.2.2        python-dateutil==2.7.3

RUN pip install --no-cache-dir --upgrade setuptools \
  && pip install --no-cache-dir --upgrade -r requirements.txt

RUN apt-get update \
  && apt-get install -y wget \
  && rm -rf /var/lib/apt/lists/*

RUN wget -q -O /builder/dart-sass.tar.gz https://github.com/sass/dart-sass/releases/download/${SASS_VERSION}/dart-sass-${SASS_VERSION}-linux-x64.tar.gz \
  && tar xvzf /builder/dart-sass.tar.gz --directory=/builder \
  && rm /builder/dart-sass.tar.gz

ENTRYPOINT ["pelican"]