DonaldRauscher.com

A Blog About D4T4 & M47H

Dealing with High Cardinality Categorical Variables & Other Learnings from the Kaggle Renthop Challenge

20 June ’17

I recently completed the Kaggle Renthop Competition. I had a lot of fun with it. One of my biggest takeaways from the competition was developing a transferable approach for dealing with high cardinality categorical variables like ZIP codes, NAICS industry codes, ICD10 diagnosis codes etc. I developed a simple Bayesian approach to encode these variables as probabilities which can then be included as features in a model. More on this in a few.

My Kaggle Model

A summary of my final solution:

  • My best solution was a model stack with 2 XGB models and 2 GLMNET models.
  • I rather lazily tuned my XGB models by looking at the parameters in high-performing public notebooks: eta=0.02, min_child_weight=3, max_depth=6, colsample_bytree=0.4, subsample=1, gamma=0, nrounds = 2000. I also implemented early stopping to prevent overfitting. For GLMNET, I used Lasso (alpha = 1); a safe choice to limit model complexity. For both models, I used 5-fold cross validation.
  • I blended these models using a simple linear combination, which I tuned on cross-validated, out-of-sample predictions.
  • As already noted, I spent a lot of time dealing with the two high cardinality categorical variables: building_id and manager_id. I included one-hots for high-frequency levels. I also encoded these variables as probabilities, which I fed into the model. More on this later.
  • Rather than use geographic clustering, I used the latitude/longitude variables to place each listing in a specific neighborhood. Neighborhood shapes from Zillow.
  • An awesome estimate of price per square foot from fellow contributor Darnal
  • I extract n-grams from the listing descriptions with high odds ratios, then used PCA to combine into some uncorrelated, keyword-based features.

My score (mlogloss) was 0.53243, which was ~650 out of ~2500 (~25th percentile). Not a great showing, but I can live with it. Or rather I must live with it because the competition is over (thank goodness).

I think I could have gotten some improvements by (1) doing more feature engineering and/or (2) using a more comprehensive model stack. My favorite solution writeup was from 9th place James Trotman. In it, he details some very clever features that he used (e.g. descriptive stats on corpus of manager descriptions, avg. time between listings for each manager, total photo size and area, number of listing "mistakes") as well as how he constructed his 3-level model stack. Several other top solutions (e.g. 2nd, 3rd) cited using a meta-modeling framework called StackNet.

High Cardinality Categorical Variables

One of the things that I quickly noticed was that we had two challenging high cardinality categorical variables: building_id and manager_id. Across test (N = 74,659) and train (N = 49,352), these two variables had 11,635 and 4,399 unique values respectively. With so many levels, one-hotting is not a viable solution. A more elegant approach is to instead transform the original categorical values into probabilities via a simple Bayesian model:

The question: how do we estimate these probabilities? A simple approach might be to just take the average interest level (excluding the point itself from the calculation). Famous Kaggler Owen Zhang is a proponent of this approach, which he calls "leave-one-out encoding" here. My main objection to this approach is that there will obviously be a lot of variance for low-frequency categories. Intuitively, it makes more sense to do a weighted average of the posterior distribution (the average of the observed points) and the prior distribution (the average for the overall population) and have the posterior/prior weight depend on sample size. For categories with more data points, we should weight the posterior distribution more. For categories with fewer data points, we should weight the prior distribution more.

I instead thought of each category as being a binomial distribution for which we don't know the probability of success (q) for each trial, which is known as a beta-binomial distribution. The prior distribution for the probability of success and the posterior distribution given s successes and f failures (success = high or medium interest) are conjugate beta distributions:

So simple! And it's doing what we want it to do intuitively, which is assign more weight to the posterior distribution when there are more data points and more credit to the prior distribution when there are fewer data points. Though we still need to estimate our prior distribution hyperparameters, alpha and beta. For this, I just used MLE. To prevent our beta binomial PMF from returning unmanageably small values, I capped the number of trials at 100. Here is a nifty little function that I wrote for calculating these:

library(dplyr)
library(lazyeval)

# fit beta distribution with mle
dbetabinom <- function(k, n, a, b) {
  n2 <- ifelse(n > 100, 100, n)
  k2 <- round(k * n2 / n)
  beta(k2 + a, n2 - k2 + b) / beta(a, b)
}

betabinom_ll <- function(k, n, par) {
  sum(-log(dbetabinom(k, n, par[1], par[2])))
}

beta_mle <- function(...){
  par <- optim(par = c(1,1), fn=betabinom_ll, method="L-BFGS-B", lower=c(0.5,0.5), upper=c(500,500), ...)$par
  return(data.frame(a = par[1], b = par[2]))
}

# function for probabilizing high cardinality categorical variable
probabilize_high_card_cat <- function(df, y, x, seg = 1, loo = 1){

  # set x, y, and seg
  df$y <- f_eval(f_capture(y), df)
  df$x <- f_eval(f_capture(x), df)
  df$seg <- f_eval(f_capture(seg), df)

  # determine prior for each segment
  dist <- df %>%
    filter(!is.na(y)) %>% # df includes both test and train
    group_by(seg, x) %>% summarise(k = sum(y), n = n()) %>% ungroup() %>%
    group_by(seg) %>% do(beta_mle(k = .$k, n = .$n)) %>% ungroup()

  # calculate posterior probabilities
  df <- df %>%
    left_join(dist, by = c("seg")) %>%
    group_by(x) %>% mutate(
      k = sum(y, na.rm = TRUE) - loo * ifelse(!is.na(y), y, 0),
      n = sum(!is.na(y), na.rm = TRUE) - loo * as.integer(!is.na(y))
    ) %>% ungroup() %>%
    mutate(y2 = (a + k) / (a + b + n))

  return(df$y2)

}

# example
df$building_interest_h <- probabilize_high_card_cat(df, ifelse(interest_level == "high", 1, 0), building_id, 1, 1)

Overall, great challenge. You can find my entire code base on my GH here. Cheers!

A simple Salesforce.com bulk API client for Python 3

15 March ’17

I recently put together my first Python package! My company (AbleTo) recently migrated our CRM to Salesforce.com. Though not the most elegant interface, many tools (e.g. Marketo) provide off-the-shelf integrations with Salesforce, which was very appealing to us. To migrate all of our customer data (several million records), I used Salesforce.com's bulk API. The big difference between working with the REST API and working with the bulk API is that the bulk API is asynchronous. You create a job, add one or more batches to that job, close the job, and wait for it to complete. Once complete, you download the results (in pieces). So a single 'operation' requires several HTTP calls to the bulk API and a rather annoying process of stitching together results. This library provides a few methods to simplify this workflow. Inputs and outputs are Pandas dataframes. To avoid reinventing the wheel, I leveraged the SalesforceLogin class in the simple_salesforce package for authentication.

Here are the key methods for performing bulk API queries:

  • query: creates a query job and submits a SOQL query as a batch to that job
  • get_all_query_results: downloads results from a query job into a Pandas dataframe

Here are the key methods for performging bulk API jobs:

  • create_job: creates a job (operations types include 'insert', 'upsert , 'update', 'delete', and 'hardDelete')
  • bulk_csv_operation: breaks Pandas dataframe into chunks and adds each chunk as a batches to a job
  • get_bulk_csv_operation_results: downloads results from a bulk CSV job into a Pandas dataframe

Here is the an example of how to use the library:

from sfdc_bulk import SalesforceBulkAPI

# set up our SF object
bulk = SalesforceBulkAPI(**sfdc_credentials)

# pull down some records into a pandas df
query_job = bulk.query('SELECT Id, Company FROM Lead LIMIT 10000')
some_records = bulk.get_all_query_results(query_job)

# make an update
update = some_records[some_records.Company == "Blockbuster Video"]
update['DoNotCall'] = 1

# push to SFDC
update_job = bulk.create_update_job(object='Lead', contentType='CSV')
bulk.bulk_csv_operation(update_job, update)
update_results = bulk.get_bulk_csv_operation_results(update_job)

Package is available on PyPI, so it can be installed easily with pip. Enjoy!

Post-Specific Resources in Jekyll

11 March ’17

I set up this blog on Jekyll last year (largely just to have a repository for my 538 Riddler solutions haha). I really like Jekyll because it is simple, supports Markdown and Liquid templating, can be hosted for free on Github. However, I did recently notice that my site was getting a little bloated. Pages were taking a long time to load because I was always loading a ton of Javascript resources (jQuery, Katex, d3, and plotly.js). However, not every post uses all of these, obviously. I recently implemented some logic that only loads the required resources for each page/post. Here's a quick overview.

1. Add Include For Each Resource

First, we need to make an include for each resource. For some resources like plotly, this is just a tag referencing the JS library and/or CSS stylesheets. For other resources like Katex, it includes some additional scripting to implement the resource as well.

plotly.html

<script src="https://cdn.plot.ly/plotly-latest.min.js"></script>

2. Add Resources Array to Post YAML Front Matter

For each post, I added an array called 'resources' to the YAML front matter. This array holds a list of all the resources that the post requires.

---
layout: post
title: 'Test'
date:   1970-01-01
tags: [lorem, ipsum]
permalink: /test
resources: [plotly]
---
...

3. Add Include for Loading Resource If Required

Next, I made another include which adds a resource if it is required. For paginator pages, we need to loop through each post on the page. For specific post pages, we obviously just check that individual page.

add_resource.html

{% assign add_resource = false %}
{% if paginator.posts %}
  {% for post in paginator.posts %}
    {% if post.resources contains resource %}
      {% assign add_resource = true %}
      {% break %}
    {% endif %}
  {% endfor %}
{% endif %}
{% if page.resources contains resource %}
  {% assign add_resource = true %}
{% endif %}
{% if add_resource %}
  {% include {{ resource | append:'.html' }} %}
{% endif %}

4. Add Logic to Check Each Resource

Finally, I added a little more Liquid logic to the head tag (which is in a head.html include in my case) to check each resource. In your _config.yml file, add an array called 'resources' which holds a list of all available resources. Names should match the file names of the includes for each resource.

_config.yml

...
resources: [katex, plotly, jquery, d3]

head.html

<head>
  ...
  {% for resource in site.resources %}
    {% include add_resource.html resource=resource %}
  {% endfor %}
</head>

And that's it! You can see this implemented in my Jekyll site here. Enjoy.