Bayesian modeling in your favorite language(s)

When I was in college, the division between Bayesian and frequentist statistics seemed significant (pun maybe intended). Perhaps it seemed that way because of this 2012 book or this 2014 NYTimes article ? Interestingly, for just as long, there have been rebuttals, that a “Bayesian vs. frequentist” mindset is not productive or practical. I generally tend to agree with these rebuttals, that you should understand the data and use the right tool for the job. And speaking of tools, here are some ways you can get Bayesian on your favorite dataset:

STAN (C/C++)

The classic software for Bayesian statistics is STAN. Though written in C, it has a wide range of language APIs, including Python and R (more below). The documentation is excellent, there are a collection of case studies to get beginners started.

R

Since R is a statistical language, RStan is logical and inevitable. In fact, the set of use cases above are dominated by the number of rstan entries. The popular Doing Bayesian Data Analysis book is now in its second edition, though it’s by no means the only rstan book.

Python

The newer, sexier way to do Bayesian modeling in Python is pymc3, which feels more Pythonic. It is covered by the excellent book Bayesian Methods for Hackers, which has interesting examples.

Since my phone has been tracking my running (via RunKeeper) and biking (via Strava), I thought these may be good real datasets to play around with. It was fairly straightforward to extract the XML from the iPhone’s Health app, parse it and save as a dataset. From there it wasn’t too hard to model some distributions using the pymc3 package: exponential for each λ representing the step count before and after a hypothesized shift in activity level, and τ for the actual time of the shift. However, if I only wanted to detect when a change occurred in the step count data, building a full model is an overkill. It’s better to simply use time-series methods like changepoint detection.

The next example in the book involves detecting students who cheat on tests. Now we’re delving a little deeper in PyMC, defining a Deterministic variable using a Theano tensor. PyMC underwent an API change in version 3, but the steps are still the same. This is a bigger project, so the results weren’t instantaneous as like the previous project. But then again, I was running on my ancient Linux desktop without a GPU, so the Theano back-end was not perfoming optimally. My mid-2014 MacBookPro performed only marginally better, with the added complication of missing header files in OSX Mojave.

Written on October 9, 2019