DonaldRauscher.com

A Blog About D4T4 & M47H

Moving My Blog from Jekyll/GitHub to Pelican/GCS

28 May ’18

I recently moved this blog from Jekyll/GitHub to Pelican/GCS. Mainly, I wanted to move to a Python-based framework where I would have more flexibility to customize (e.g. add/create plugins). And cost isn't really a consideration as both options are free. GitHub pages is actually powered by Jekyll and gives you free hosting on GitHub's github.io domain. The fully-rendered blog is <<1Gb, and you get 5Gb of GCS storage for free.

I used Google Container Builder for CI/CD to publish my website on a GCS bucket, drawing upon a lot of ideas from this tutorial. The build has the following steps:

  1. Pull from GitHub - This performs a shallow git clone of my repo on GitHub.
  2. Generate site with Pelican - I needed to create a custom Google Container Builder step for this. See Dockerfile below. In addition to installing Pelican and some other Python depenencies, this container also installs Sass for CSS.
  3. Push to GCS Bucket - This uploads the HTML generated by Pelican to a GCS bucket using gsutil rsync.

I needed to update a few settings on the GCS bucket to serve the site, namely setting a link to the index page (MainPageSuffix) and making the site globally readable. Finally, I set up build triggers; whenever I push to my master branch, it automatically triggers a build.

Overall, I'm loving Pelican so far. You can find my blog's new repo here. Cheers!

cloudbuild.yaml

steps:
  - name: gcr.io/cloud-builders/git
    args: ['clone', '-b', '${_BRANCH}', '--single-branch', '--depth', '1', 'https://github.com/donaldrauscher/blog-pelican.git']
  - name: gcr.io/${PROJECT_ID}/pelican:latest
    args: ["content", "-v"]
    dir: blog-pelican
  - name: gcr.io/cloud-builders/gcloud
    entrypoint: gsutil
    args: ["-m", "rsync", "-r", "-c", "-d", "./output", "gs://${_SUB_DOMAIN}.donaldrauscher.com"]
    dir: blog-pelican
substitutions:
  _BRANCH: master
  _SUB_DOMAIN: www

Dockerfile for Pelican GCB step

FROM gcr.io/cloud-builders/gcloud

ENV SASS_VERSION 1.3.2
ENV PATH /builder/dart-sass:${PATH}

COPY requirements.txt .

# requirements.txt:
# blinker==1.4        Markdown==2.6.11        pytz==2018.4
# docutils==0.14      MarkupSafe==1.0         six==1.11.0
# feedgenerator==1.9  pelican==3.7.1          Unidecode==1.0.22
# Jinja2==2.10        Pygments==2.2.0         webassets==0.12.1
# jsmin==2.2.2        python-dateutil==2.7.3

RUN pip install --no-cache-dir --upgrade setuptools \
  && pip install --no-cache-dir --upgrade -r requirements.txt

RUN apt-get update \
  && apt-get install -y wget \
  && rm -rf /var/lib/apt/lists/*

RUN wget -q -O /builder/dart-sass.tar.gz https://github.com/sass/dart-sass/releases/download/${SASS_VERSION}/dart-sass-${SASS_VERSION}-linux-x64.tar.gz \
  && tar xvzf /builder/dart-sass.tar.gz --directory=/builder \
  && rm /builder/dart-sass.tar.gz

ENTRYPOINT ["pelican"]

Test

28 May ’18

Hello World!

Using Word2Vec for "Code Names"

12 May ’18

"Code Names" Rules: People are divided into two teams. The board is comprised of 25 words divided into 4 categories: blue team, red team, neutral, and the death word. People are divided evenly into two teams (red and blue). In each round, two people from either team take turns giving 1 word clues. The goal is to get the other members of your team to guess your teams' words and NOT the other words, especially not the death word; if your team guesses the death word, you immediately lose.

It is a really fun game. I also thought it might be an interesting application for Word2Vec. Word2Vec is a two-layer neural network which models the linguistic contexts of words. There are two approaches to training Word2Vec: CBOW (continuous bag of words) and skip-gram. CBOW predicts a word from a window of surrounding words. Skip-gram uses a single word to predict words in the surrounding window. This is a nice summary. Also cool, you don't need to train your own Word2Vec model! Lots of people/organizations provide pre-trained word vectors that you can easily implement, e.g. Google News and Facebook.

I built a small app that uses Word2Vec to generate word hints for "Code Names". I used Python's gensim package to measure word similarities / generate hints using pre-trained word vectors from Stanford NLP's GloVe. The app itself is built using Plotly's Dash, which is analogous to Shiny for R. I packaged the entire thing in a Docker container.

Dash app (app.py)

import os

import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output

import pandas as pd
import numpy as np

from gensim.models import KeyedVectors

import plotly.figure_factory as ff


# initialize app
app = dash.Dash()
server = app.server

# load model
model = 'glove/w2v.{}.txt.gz'.format(os.getenv('GLOVE_MODEL', 'glove.6B.50d'))
word_vectors = KeyedVectors.load_word2vec_format(model, binary=False)

# precompute L2-normalized vectors (saves lots of memory)
word_vectors.init_sims(replace=True)


# pandas df to html
def generate_table(df, max_rows=10):
    return html.Table(
        # header
        [html.Tr([html.Th(col) for col in df.columns])] +

        # body
        [html.Tr([
            html.Td(df.iloc[i][col]) for col in df.columns
        ]) for i in range(min(len(df), max_rows))]
    )


# generate some clues
def generate_hints(words):
    try:
        hints = word_vectors.most_similar(positive=words)
        hints = pd.DataFrame.from_records(hints, columns=['word','similarity'])
        return generate_table(hints)
    except KeyError as e:
        return html.Div(str(e))


# generate dendrogram for word similarity
def generate_dendro(words):
    try:
        similarities = np.array([word_vectors.distances(w, words) for w in words])
        figure = ff.create_dendrogram(similarities, labels=words)
        figure['layout'].update({'width': 800, 'height': 500})
        return figure
    except KeyError as e:
        pass


# set up app layout
app.layout = html.Div(children=[
    html.H1(children='Code Names Hints'),
    html.Table([
        html.Tr([html.Td("All Words:"), html.Td("Words for Hints:")]),
        html.Tr([html.Td(dcc.Textarea(id='words-all', value='god zeus bat ball mountain cold snow', style={'width': 500})),
                 html.Td(dcc.Input(id='words', value='bat ball', type='text'))]),
        html.Tr([html.Td(dcc.Graph(id='dendro')), html.Td(html.Div(id='hints'))])
    ])
])


# set up app callbacks
@app.callback(
    Output(component_id='dendro', component_property='figure'),
    [Input(component_id='words-all', component_property='value')]
)
def update_dendro(input_value):
    words = [w.lower() for w in input_value.strip().split(' ')]
    return generate_dendro(words)

@app.callback(
    Output(component_id='hints', component_property='children'),
    [Input(component_id='words', component_property='value')]
)
def update_hints(input_value):
    words = [w.lower() for w in input_value.strip().split(' ')]
    return generate_hints(words)


# run
if __name__ == '__main__':
    app.run_server(debug=True)

Dockerfile

FROM python:3.5-slim

ENV PORT 8050
ENV GLOVE_MODEL glove.6B.200d
ENV GUNICORN_WORKERS 3
ENV APP_DIR /app

WORKDIR $APP_DIR

RUN apt-get update \
  && apt-get install -y unzip gzip wget \
  && rm -rf /var/lib/apt/lists/*

COPY requirements.txt app.py entrypoint.sh ./
RUN chmod +x entrypoint.sh

RUN pip install -r requirements.txt

# requirements.txt:
# dash==0.21.1                    gensim==3.4.0
# dash-core-components==0.22.1    pandas==0.22.0
# dash-html-components==0.10.1    gunicorn==19.8.1
# dash-renderer==0.12.1           gevent==1.2.2

RUN wget -q http://nlp.stanford.edu/data/glove.6B.zip \
  && unzip glove.6B.zip -d glove \
  && rm glove.6B.zip \
  && python -m gensim.scripts.glove2word2vec --input glove/${GLOVE_MODEL}.txt --output glove/w2v.${GLOVE_MODEL}.txt \
  && gzip glove/w2v.${GLOVE_MODEL}.txt \
  && rm glove/*.txt

EXPOSE $PORT

ENTRYPOINT $APP_DIR/entrypoint.sh

entrypoint.sh

#!/bin/bash
echo Starting Gunicorn...
gunicorn app:server \
    --name code-names \
    --bind 0.0.0.0:$PORT \
    --workers $GUNICORN_WORKERS \
    --preload \
    --worker-class gevent \
    --timeout 600 \
    --log-level info \
    "$@"

Example output:

Overall, it does...okay haha. In some cases, it does surprisingly well. For instance, the app provides "published" as a top hint for "book" and "penguin". However, the algorithm struggles to identify commonalities that may not be explicitly collocated in text. For instance, for "dog" and "whale", "mammal" might be a good hint. However, our app simply lists other animals, e.g. "cat" and "shark".

I'm hosting a version of the app here on Now.sh. And a link to my repo on GH. Cheers!