DonaldRauscher.com

A Blog About D4T4 & M47H

Two Options for Hosting a Private PyPI Repository

11 August ’18

A few years back, I read an interesting post about how Airbnb's data science team developed their own internal R package, Rbnb, to standardize solutions to common problems and reduce redundancy across projects. I really like this idea and have implemented a similar solution for Python at places that I have worked. This post details two options for hosting a private Python package, both of which leverage Google Cloud Build for CI/CD.

Option #1 - Gemfury

Gemfury is a cloud package respository that you can use to host both public and private packages for Python (and lots of other languages). Some useful instructions for how to upload Python packages to Gemfury and install them with pip. The following Cloud Build pipeline will, on tagged commits, download the package from Google Cloud Repositories, run tests, package, and curl to Gemfury:

steps:
  - name: gcr.io/cloud-builders/gcloud
    args: ['source', 'repos', 'clone', '${_PACKAGE}', '--project=${PROJECT_ID}']
  - name: gcr.io/cloud-builders/git
    args: ['checkout', '${TAG_NAME}']
    dir: '/workspace/${_PACKAGE}'
  - name: gcr.io/${PROJECT_ID}/python-packager:latest
    entrypoint: 'bash'
    args: ['-c', 'pip3 install -e . && python3 -m pytest -s']
    dir: '/workspace/${_PACKAGE}'
  - name: gcr.io/${PROJECT_ID}/python-packager:latest
    args: ['setup.py', 'sdist']
    dir: '/workspace/${_PACKAGE}'
  - name: gcr.io/cloud-builders/curl
    entrypoint: 'bash'
    args: ['-c', 'curl -f -F package=@dist/${_PACKAGE}-${TAG_NAME}.tar.gz https://$${FURY_TOKEN}@push.fury.io/${_FURY_USER}/']
    secretEnv: ['FURY_TOKEN']
    dir: '/workspace/${_PACKAGE}'
secrets:
- kmsKeyName: projects/blog-180218/locations/global/keyRings/djr/cryptoKeys/fury
  secretEnv:
    FURY_TOKEN: CiQAUrbjD9VjSHPnmMvLV0Jv+duPGyuaIgS0C2u1LmcVRGHY/BwSPQCP7mNtRVGShanmgHUx5RHoohNDGWX4FnscAmbMBVplms0uOQfHLmLy/wkfaxAHYoK2pX/LKDxDIwQzAz0=
substitutions:
  _PACKAGE: djr-py
  _FURY_USER: donaldrauscher

NOTE: Need to create a KMS key/keyring, give Cloud Build access to it, and use that key to encrypt your Fury token. You can find additional instructions on how to do this here.

echo -n ${FURY_TOKEN} | gcloud kms encrypt --plaintext-file=- --ciphertext-file=- --location=global --keyring=djr --key=fury | base64

Option #2 - GCS Bucket

If you don't care about restricting which people can access your package (I clearly do not), then you can host a simple PyPI respository on a GCS bucket using dumb-pypi. First, you will need to set up a GCS bucket where you can host a static site. This Cloud Build pipeline uploads the package to GCS and triggers a second Cloud Build pipeline which rebuilds the PyPI repository on the specified GCS bucket.

steps:
  - name: gcr.io/cloud-builders/git
    args: ['clone', '-b', '${TAG_NAME}', '--single-branch', '--depth', '1', 'https://github.com/${_GITHUB_USER}/${_PACKAGE}.git']
  - name: gcr.io/${PROJECT_ID}/python-packager:latest
    entrypoint: 'bash'
    args: ['-c', 'pip3 install -e . && python3 -m pytest -s']
    dir: '/workspace/${_PACKAGE}'
  - name: gcr.io/${PROJECT_ID}/python-packager:latest
    args: ['setup.py', 'sdist']
    dir: '/workspace/${_PACKAGE}'
  - name: gcr.io/cloud-builders/gsutil
    args: ['cp', 'dist/${_PACKAGE}-${TAG_NAME}.tar.gz', 'gs://${_BUCKET}/raw/']
    dir: '/workspace/${_PACKAGE}'
  - name: gcr.io/cloud-builders/git
    args: ['clone', 'https://github.com/donaldrauscher/gcs-pypi.git']
  - name: gcr.io/cloud-builders/gcloud
    args: ['builds', 'submit', '--config', 'cloudbuild.yaml', '--no-source', '--async', '--substitutions', '_BUCKET=${_BUCKET}']
    dir: '/workspace/gcs-pypi'
substitutions:
  _PACKAGE: djr-py
  _BUCKET: pypi.donaldrauscher.com
  _GITHUB_USER: donaldrauscher

===

NOTE: Both of these Cloud Build jobs require a python-packager custom Cloud Build step. This is a simple Docker container with some Python utilities:

FROM gcr.io/cloud-builders/gcloud

RUN apt-get update \
  && apt-get install -y python3-pip \
  && rm -rf /var/lib/apt/lists/*

RUN pip3 install --upgrade pip setuptools wheel pylint pytest

ENTRYPOINT ["python3"]

I used option #2 to host my personal Python package (djr-py) on http://pypi.donaldrauscher.com/. Enjoy!

Building Pipelines in K8s with Brigade

14 July ’18

Kubernetes started as a deployment option for stateless services. However, people are increasingly using Kubernetes clusters to execute complex workflows for CI/CD, ETL, machine learning, etc. And there are a number of tools/projects that have sprung up to help orchestrate these workflows. Two that I have been exploring are Argo (from Applatix) and Brigade (from DEIS, now Microsoft, the same folks who developed the popular K8s package manager Helm).

The container is, of course, at the center of both of these frameworks. Each step in the pipeline is a job that is executed by a Docker container. The major difference between Argo and Brigade is how they specify pipelines. In Argo, pipelines are declared with YAML. In Brigade, pipelines are scripted with JavaScript. Co-creator Matt Butcher provides a great explanation for why they chose this approach . I found this idea really interesting, so I chose to take Brigade for a spin.

I built a simple pipeline which loads cryptocurrency prices from CoinAPI into Google BigQuery. I also used MailGun to send notifications when pipelines complete/fail.

1. Creating Container for Pipeline Steps

Firstly, we need a Docker container that will execute pipeline steps. In my simple use case, I was able to use a single image, but you could just as easily use a different image for each step. I used a Google Cloud Container Builder image as my base image. This contains gcloud, kubectl, gsutil, and bq utilities. To that, I added a tool called jq, which I used to convert JSON into newline-delimited JSON for the BigQuery import.

FROM gcr.io/cloud-builders/gcloud:latest

RUN apt-get update \
  && apt-get install -y wget \
  && rm -rf /var/lib/apt/lists/*

RUN wget -O /usr/local/bin/jq https://github.com/stedolan/jq/releases/download/jq-1.5/jq-linux64 \
  && chmod +x /usr/local/bin/jq

ENTRYPOINT ["bash"]

2. Creating Brigade Project

Next, I needed to create a Brigade project. The Brigade project serves as an execution context for our pipeline. Brigade projects can be easily created with the brigade-project Helm chart. The project contains a link to a Git repo, which should contain a brigade.js script for our pipeline. It also contains secrets that can be referenced throughout our pipeline.

# values.yaml
project: donald/crypto
namespace: brigade
repository: github.com/donaldrauscher/brigade-crypto
cloneURL: https://github.com/donaldrauscher/brigade-crypto.git
# secrets.yaml
secrets:
  projectId: ...
  coinAPIKey: ...
  mailgunAPIKey: ...
helm install brigade/brigade-project -f values.yaml,secrets.yaml --namespace brigade

3. Creating the Brigade Pipeline

Now we need to set up a Javascript script which defines our pipeline. We can pass any script to Brigade at runtime, but this is discouraged; this script should ideally be placed in the Git repo that is referenced in the Brigade project. Some good documentation on how to write Brigade scripts.

\\brigade.js
const { events, Job } = require("brigadier")

function makeImg(p) {
  return "gcr.io/" + p.secrets.projectId + "/brigade-crypto:latest"
}

function mailgunCmd(e, p) {
  var key = p.secrets.mailgunAPIKey

  if (e.cause.trigger == 'success'){
    var msg = "Build " + e.cause.event.buildID + " ran successfully"
  } else {
    var msg = e.cause.reason
  }

  return `
    curl -s --user "api:${key}" https://api.mailgun.net/v3/mg.donaldrauscher.com/messages \
    -F from="mg@donaldrauscher.com" \
    -F to="donald.rauscher@gmail.com" \
    -F subject="Brigade Notification" \
    -F text="${msg}"
  `
}

events.on("exec", (e, p) => {
  var j1 = new Job("j1", makeImg(p))

  j1.storage.enabled = false

  j1.env = {
    "COIN_API_KEY": p.secrets.coinAPIKey,
    "TIMESTAMP": e.payload.trim()
  }

  j1.tasks = [
    "export TIMESTAMP=${TIMESTAMP:-$(date '+%Y-%m-%dT%H:%M')}",
    "curl https://rest.coinapi.io/v1/quotes/current?filter_symbol_id=_SPOT_ --request GET --header \"X-CoinAPI-Key: $COIN_API_KEY\" --fail -o quotes.json",
    "jq --compact-output '.[]' quotes.json > quotes.ndjson",
    "gsutil cp quotes.ndjson gs://djr-data/crypto/$TIMESTAMP/quotes.ndjson",
    "bq load --replace --source_format=NEWLINE_DELIMITED_JSON crypto.quotes gs://djr-data/crypto/$TIMESTAMP/quotes.ndjson"
  ]

  j1.run()
})

events.on("after", (e, p) => {
  var a1 = new Job("a1", makeImg(p))
  var cmd = mailgunCmd(e, p)
  a1.storage.enabled = false
  a1.tasks = [cmd]
  a1.run()
})

events.on("error", (e, p) => {
  var e1 = new Job("e1", makeImg(p))
  var cmd = mailgunCmd(e, p)
  e1.storage.enabled = false
  e1.tasks = [cmd]
  e1.run()
})

4. Testing Pipeline

Finally, we can test our pipeline. To manually trigger builds and check the status of builds, you will need the brig command line tool. You can download this from one of the Brigade releases.

brig run donald/crypto -f brigade.js -n brigade
export BRIG_PROJECT_ID=$(brig project list -n brigade | grep "donald/crypto" | head -1 | awk '{ print $2 }')
export BRIG_BUILD_ID=$(brig build list -n brigade | grep "$BRIG_PROJECT_ID" | tail -1 | awk '{ print $1 }')
brig build logs $BRIG_BUILD_ID -n brigade
kubectl logs j1-$BRIG_BUILD_ID -n brigade

I also adapted this example from Matt Butcher to create a cronjob to kick off this pipeline periodically. My main revision was to insert the timestamp into the event payload using a K8s init container.

===

Overall, I am really impressed with Brigade. I'm really excited to use it more You can find a link to all of my work here. Cheers!

Topic Modeling Fake News

26 June ’18

I decided to change things up a little bit and take on an unsupervised learning task: topic modeling. For this, I explored an endlessly entertaining dataset, a database of fake news articles compiled by Kaggle. It is comprised of ~13K different articles from 200 different sources circa Oct'16 - Nov'16 (a period which coincided with the 2016 US Presidential Election).

First, I set up a pipeline to express each article as a bag-of-words. I included a few preprocessing steps to clean the data (lowercase, remove URLs, remove HTML tags, and detect common phrases). I also used TF-IDF (term frequency - inverse document frequency) to assign higher weights to more "important" words. For the topic model itself, I used non-negative matrix factorization (NMF). Other popular approaches include probabilistic models like LDA.

Lowercase
Remove links
Remove HTML tags
[Not supported by viewer]
Tokenize
Tokenize
Phraser
Phraser
Filter Stop Words
Filter Stop Words
Min/Max Term Filtering + Vectorization
Min/Max Term Filtering + Vectorization
NMF
NMF
TF-IDF
TF-IDF
Corpus of
Raw Documents
Corpus of <br>Raw Documents
Document Term
Matrix
Document Term<br>Matrix<br>

Whether using NMF or LDA, determining the number of topics to model is always a challenge. If topics are too big, our topics will be unspecific and too broad. If topics are too small, our topics will be overly descriptive and too narrow. We can build models with different numbers of topics, but how do we evaluate model performance? Two options:

  1. Topic Cohesiveness - The extent to which a topic’s top keywords are semantically related in our overall corpus. We can use word embeddings created with Word2Vec to measure the similarity between keywords. Higher = better.
  2. Topic Generality - The extent to which each topic's top keywords overlap with other topics' top keywords. We can use mean pairwise Jaccard similarity between topics for this. Lower = better.

Source: D. O’Callaghan, D. Greene, J. Carthy, and P. Cunningham, “An Analysis of the Coherence of Descriptors in Topic Modeling,” Expert Systems with Applications (ESWA), 2015. [link]

I selected a number of topics that balanced these two measures (n=40). The below chart summarizes the frequency and top keywords for each topic in the final model. Overall, I found the topics to be highly cohesive and relevant!

Note: I found that my solutions were highly sensitive to NMF regularization. When using an L1 regularization, I had one very large topic and a long tail of small topics. I had a similar problem to a lesser degree when using L2 regularization. In the end, I chose to use no regularization because I was okay with articles belonging to multiple topics.

Link to my GH repo here. You can also play around with the model itself on Binder, a free-to-use (!) deployment of BinderHub.

Binder