Sub4Sub network gives free YouTube subscribers
Get Free YouTube Subscribers, Views and Likes

Jonathan Sedar - Hierarchical Bayesian Modelling with PyMC3 and PySTAN

Follow
PyData

PyData London 2016

Can we use Bayesian inference to determine unusual car emissions test for Volkswagen? In this worked example, I'll demonstrate hierarchical linear regression using both PyMC3 and PySTAN, and compare the flexibility and modelling strengths of each framework.

Overview

Bayesian inference bridges the gap between whitebox model introspection and blackbox predictive performance. We gain the ability to fully specify a model and fit it to observed data according to our prior knowledge. Small datasets are handled well and the overall method and results are very intuitive: lending to both statistical insight and future prediction.

This talk will demonstrate the use of Bayesian inference in a realworld scenario: using a set of hierarchical models to compare exhaust emissions data from a set of vehicle manufacturers.

This will be interesting to people who work in the Type A side of data science, and will demonstrate usage of the tools as well as some theory.

The Frameworks

PyMC3 and PySTAN are two of the leading frameworks for Bayesian inference in Python: offering concise model specification, MCMC sampling, and a growing amount of builtin conveniences for model validation, verification and prediction.

PyMC3 is an iteration upon the prior PyMC2, and comprises a comprehensive package of symbolic statistical modelling syntax and very efficient gradientbased samplers using the Theano library of deeplearning fame for gradient computation. Of particular interest is that it includes the Non UTurn Sampler NUTS developed recently by Hoffman & Gelman in 2014, which is only otherwise available in STAN.

PySTAN is a wrapper around STAN, a major3 opensource framework for Bayesian inference developed by Gelman, Carpenter, Hoffman and many others. STAN also has HMC and NUTS samplers, and recently, Variational Inference which is a very efficient way to approximate the joint probability distribution. Models are specified in a custom syntax and compiled to C++.

The RealWorld Problem & Dataset

I'm currently quite interested in road traffic and vehicle insurance, so I've dug into the UK VCA Vehicle Type Approval to find their Car Fuel and Emissions Information for August 2015. The raw dataset is available for direct download and is small but varied enough for our use here: roughly 2500 cars and 10 features inc hierarchies of car parentmanufacturer manufacturer model.

I will investigate the car emissions data from the pointofview of the Volkswagen Emissions Scandal which seems to have meaningfully damaged their sales. Perhaps we can find unusual results in the emissions data for Volkswagen.

GitHub repo: https://github.com/jonsedar/pymc3_vs_... 00:00 Welcome!
00:10 Help us add time stamps or captions to this video! See the description for details.

Want to help add timestamps to our YouTube videos to help with discoverability? Find out more here: https://github.com/numfocus/YouTubeVi...

posted by reinaemily6h