Massive Parallelization to Learn from Massive Data

Tuesday, June 25, 2013 - 2:00pm - 3:30pm
Lind 305
Marc Suchard (University of California, Los Angeles)

Hands-on High-performance Statistical Computing Techniques

June 25, 2013 4:00 pm - 5:30 pm

Following a series of high-profile drug safety disasters in recent years, many countries are redoubling their efforts to ensure the safety of licensed medical products. Large-scale observational databases such as claims databases or electronic health record systems are attracting particular attention in this regard, but present significant methodological and computational concerns. Likewise, fusion of real-time satellite data with in situ sea surface temperature measurements for ecological modeling remains taxing for probabilistic spatial-temporal models on a global scale. In this talk, I discuss how high-performance statistical computation, including graphics processing units, can enable complex inference methods in these massive datasets. I focus on algorithm restructuring through techniques like block relaxation (Gibbs, cyclic coordinate descent, MM) to exploit increased data/parameter conditional independence within traditional serial structures. I find orders-of-magnitude improvement in overall run-time fitting models involving tens of millions of observations.

These approaches are ubiquitous in high-dimensional biological problems modeled through stochastic processes. To drive this point home, I conclude with a seemingly unrelated example developing nonparametric models to study the genomic evolution of infectious diseases. These infinite hidden Markov models (HMMs) generalize both Dirichlet process mixtures and the usual finite-state HMM to capture unknown heterogeneity in the evolutionary process. Data squashing strategies, coupled with massive parallelization, yield novel algorithms that bring these flexible models finally within our grasp. [Joint work with Subha Guha, Ricardo Lemos, David Madigan, Bruno Sanso and Steve Scott]