A Blog About D4T4 & M47H

Analyzing Citi Bike Data w/ BigQuery

06 March ’17

I've recently started using Google Cloud Platform for some of my big data analyses. In particular, I have been playing with BigQuery. Unlike AWS Redshift, BigQuery is a fully elastic, multi-tenant database. It is very easy to setup and gives you essentially infinite scale! I decided to take BigQuery for a spin with an analysis that I've always been interested in doing: understanding how Citi Bike re-balances its inventory.

I started by downloading one year's worth of Citi Bike trips, about 14M trips and 2.4GB. The Google SDK is a great command-line resource that you can use to interact with resources on GCP. Using the Google SDK, I uploaded this data to a new Google Storage bucket (gsutil), then loaded it into a new schema in BigQuery (bq). From there, I wrote a few queries to determine station-level inflows/outputs and re-balancing between stations, then dumped these newly created tables back into Google Storage. Here is the entire workflow:

# download / extract data
for i in $(seq -f "%02g" 1 12)
    echo "Downloading 2016-$i trips"
    wget --directory-prefix=data 2016$$
    echo "Unzipping 2016-$i trips"
    unzip ./data/2016$ -d data

# upload to storage
gsutil mb gs://citi-bike-trips
gsutil -o GSUtil:parallel_composite_upload_threshold=50M -m cp ./data/*.csv gs://citi-bike-trips

# put into bigquery
bq mk trips
read -d '' schema <<- EOF
schema=`echo $schema | tr -d '[[:space:]]'`
bq load --skip_leading_rows=1 trips.trips gs://citi-bike-trips/*.csv $schema

# format start time to timestamp (tricky because multiple formats)
bq query --use_legacy_sql=false --destination_table=trips.trips2 "
  SELECT *, TIMESTAMP(PARSE_DATETIME('%m/%d/%Y %H:%M:%S', starttime)) AS starttime_ts
  FROM trips.trips
  WHERE REGEXP_CONTAINS(starttime, r'^[0-9]{1,2}/[0-9]{1,2}/[0-9]{4}')
  SELECT *, TIMESTAMP(starttime) AS starttime_ts
  FROM trips.trips
  WHERE NOT REGEXP_CONTAINS(starttime, r'^[0-9]{1,2}/[0-9]{1,2}/[0-9]{4}')"

# pull station-to-station rebalances
bq query --use_legacy_sql=false --destination_table=trips.rebalances "
  SELECT last_end_station_id AS from_station, start_station_id AS to_station, COUNT(*) AS n FROM (
    SELECT bikeid, start_station_id, end_station_id, starttime_ts,
      LAG(end_station_id, 1) OVER (PARTITION BY bikeid ORDER BY starttime_ts ASC) last_end_station_id
    FROM trips.trips2
    ORDER BY bikeid, starttime_ts ASC
  ) AS x
  WHERE start_station_id != last_end_station_id
  GROUP BY 1,2

# pull list of stations with inflows and outflows
bq query --use_legacy_sql=false --destination_table=trips.stations "
  SELECT id, name, lat, lon, SUM(outflow) AS outflow, SUM(inflow) AS inflow
  FROM (
    SELECT id,
      FIRST_VALUE(name) OVER (PARTITION BY id ORDER BY starttime_ts DESC) AS name,
      FIRST_VALUE(lat) OVER (PARTITION BY id ORDER BY starttime_ts DESC) AS lat,
      FIRST_VALUE(lon) OVER (PARTITION BY id ORDER BY starttime_ts DESC) AS lon,
      outflow, inflow
    FROM (
      SELECT start_station_id AS id, start_station_name AS name, start_station_latitude AS lat, start_station_longitude AS lon,
        COUNT(*) AS outflow, 0 AS inflow, MAX(starttime_ts) AS starttime_ts
      FROM trips.trips2 GROUP BY 1,2,3,4
      SELECT end_station_id AS id, end_station_name AS name, end_station_latitude AS lat, end_station_longitude AS lon,
        0 AS outflow, COUNT(*) AS inflow, MAX(starttime_ts) AS starttime_ts
      FROM trips.trips2 GROUP BY 1,2,3,4
    ) AS x
  ) AS x
  GROUP BY 1,2,3,4"

# download two new tables
bq extract trips.rebalances gs://citi-bike-trips/outputs/rebalances.csv
bq extract trips.stations gs://citi-bike-trips/outputs/stations.csv
gsutil cp gs://citi-bike-trips/outputs/*.csv ./data/outputs/

Once downloaded, I used R to create some visualizations. Specifically, the ggmap package, which you can use to download static maps from the Google Maps API and plot data a-la-ggplot2 on top of them. Here's a plot showing which stations are net inflows vs. net outflows for bikes:

And here's another plot showing the top 100 station-to-station rebalances:

Downtown neighborhoods (e.g. East/West Village) tend to be net inflows, presumably because they are frequent leisure destinations for New Yorkers. Uptown neighborhoods (e.g. Upper East/West Side) tend to be net outflows, presumably because they are more residential. Midtown is a mixed bag. On one hand, it has major transit hubs like Penn Station, Port Authority, and Grand Central which are obviously big net outflows. However, Midtown is also a major commercial center, so I imagine a lot of people commute to work via Citi Bike.

Overall, I found BigQuery very easy to use and very fast! All of my queries ran in under 10 seconds. Previously, BigQuery only supported a non-standard, BigQuery-specific SQL dialect. However, it now supports standard SQL, including uber-useful analytic/window functions. And best of all, it's very affordable. This dataset costs ~$0.11 per month to host ($0.026/GB/month on Google Storage and $0.02/GB/month on BigQuery). Unlike AWS, you don't have to set up an instance and pay for uptime; you pay based on how many bytes you process with your queries ($5/TB) making it very convenient for adhoc/exploratory analytics. If we assume each of the above 3 queries scanned my entire 2.4GB dataset, this would cost 3*2.4/1024*$5 ~ $0.04. Except the first TB processed each month is free! Plus Google gives you a $300 credit when you sign up. All of my code is posted on my GitHub. Enjoy!

Building an Optimal Portfolio of ETFs

26 February ’17

Exchange traded funds (ETFs) have taken the market by storm. Over the last few years, we’ve seen a huge shift in assets towards passive investing, motivated by ETF’s low fee structure and the revelation that most active managers cannot beat their benchmark. This shouldn't be terribly surprising. It only takes simple arithmetic to demonstrate that active management is a zero-sum before fees and a losing proposition after fees. Even world reknown active investor Warren Buffet has suggested a simple portfolio of inexpensive index funds for the heirs to his own fortune.

However, just because ETFs are themselves portfolios doesn't mean that we don't need to think about portfolio optimization. There are well-known factors which earn market-adjusted positive returns (e.g. small > large, value > growth, high momentum > low momentum). Can we build a portfolio of ETFs that takes advantage of these tilts?


A well-constructed portfolio of ETFs can give you similar return with less risk and less market exposure than a single, market-mirroring ETF. Benchmarks for my analysis are the SPY and the MDY.


  1. “Investible” ETF maintains a great database of over ~1800 ETFs. From this, I restricted my universe to funds that are US-based, are non-sector focused, are highly liquid (334 ETFs) and have at least 5 years of returns (211/334).
  2. Factor Modeling – I regressed ETF returns (Yahoo Finance) against known factors including market returns, size, value, momentum, profitability, investment, variance, net shares issued, and accruals. Factor data generously provided by Kenneth French (the CRSP database license is a little outside my price range). I used L1 regularization to prevent overfitting factor loadings. From this, I was able to calculate each ETF’s expected return/variance and covariances with other ETFs.
  3. Portfolio Optimization – I made portfolios with target results of 6%, 8%, 10% and 12%. I used a quadratic optimizer to minimize variance within constraints. Minimum asset weights of 2.5%. Also, to ensure that a variety of factors were driving returns, I required positive tilts on size, value, momentum, profitability, and investment factors.


Overall, all 4 of my portfolios generated higher Sharpe ratios (1.6-1.8 vs. 1.1-1.3) and lower draw-downs (2%-5% vs. 9%-13%) but lower returns than the SPY and MDY, which both have returned a staggering 13.2% annually over the last 5 years. This isn't totally surprising. The model, to the extent that it can, tries to balance systematic market risk and factor risk, resulting in lower betas. Normally, this would be a good thing, except our 5 year test period sits squarely in the middle of the second longest bull market ever. I think we can expect more modest returns for the benchmarks moving forward; my model projects 8.1% and 8.9% for the SPY and MDY respectively. Expected Sharpe ratios for my portfolios (1.4-1.5) are nearly 3x higher than the benchmarks (0.5)! My target return 12% portfolio (TR12) has a beta of just 0.51 and positive loadings on size, profitability, investment, momentum, and accrual factors.

Portfolio Performance Results:

ExpectedActual (Last 5 Years)
PortfolioReturnSDSharpeReturnSDSharpeMax Draw Down

Portfolio Asset Weights:

First Trust Dow Jones Select MicroCap Index Fund (FDM)12.9%18.1%24.4%23.5%
Vanguard Long-Term Corporate Bond Index Fund (VCLT)12.3%18.1%22.4%22.7%
PowerShares Financial Preferred Portfolio (PGF)17.6%24.8%27.4%0.0%
PowerShares High Yield Equity Dividend Achievers Portfolio (PEY)6.1%8.6%12.0%33.5%
iShares Core 10+ Year USD Bond ETF (ILTB)5.9%7.9%13.7%20.3%
Vanguard Short-Term Corporate Bond Index Fund (VCSH)19.2%22.6%0.0%0.0%
PIMCO Enhanced Short Maturity Active ETF (MINT)26.0%0.0%0.0%0.0%

All of my code is posted on my GitHub. The universe of ETFs analyzed and their tilts can be downloaded here and here. Cheers!

538 Riddler: 100-Sided Die

14 January ’17

This week's Riddler involves a game played with a 100-sided die (I seriously want one). I started by thinking about the problem as an absorbing Markov Chain with 101 states, 1 state representing the end of the game and 100 states for each potential previous roll. The transition matrix is the following:

We break this transition matrix into three components: transient-to-transient (Q), transient-to-absorbing (R), and absorbing-to-absorbing (the identity matrix by definition). Q here is simply the above matrix minus the last row and the last column (since we have just 1 absorbing state). The expected number of rolls before being absorbed when starting at each transient state is the following vector:

The expected number of rolls for the game is simply the average of the values in this vector plus 1, since we're equally likely to start at any one of these initial rolls. A little R code gives us the answer:

Q <- matrix(0, ncol=100, nrow=100)
Q[upper.tri(Q,diag=TRUE)] <- 1/100
N <- solve(diag(100) - Q)
t <- N %*% matrix(1, nrow=100, ncol=1)
[1] 2.731999

Though this gets us to the answer, it's tough to extend this approach to the general N case. Let represent the expected number of rolls until the game ends given that the previous roll was . We can develop some recurrence relations starting with and working backwards. Iterative substitution gives us an expression for :

This is analagous to the vector that we computed above. Thus, the average of through plus 1 gives us the expected number of rolls for the game. And our answer here is consistent with the absorbing Markov Chain approach above. We can also extend this logic to derive an expression for the N case. Interestingly, as N goes to infinity, the expected number of rolls converges on e!