June 17 - 28, 2013
This lecture will describe two applications of Bayesian graphic models.
Graphical Markov models use graphs with nodes
corresponding to random variables, and edges that
encode conditional independence relationships between
those variables. Directed graphical models (aka Bayesian
networks) in particular have received considerable attention.
This lecture will review basic concepts in graphical
model theory such as Markov properties, equivalence,
and connections with causal inference.
This lecture will review basic principles of Bayesian inference,
Bayesian hierarchical models, and the BUGS tool
for conducting Bayesian analyses.
The counterfactual approach to casual inference dates back at least to Neyman and has developed considerably in recent decades due to pioneering work by Rubin, Pearl, Robbins, and others. This lecture will introduce the counterfactual approach and discuss specific examples.
In this lecture, we will go beyond Friday's course on structured sparsity, and
consider more complex models. Recently, a large amount of research in
statistics and signal processing has been devoted to developing structured
sparse regularization functions. The goal is to encode some a priori knowledge
about an estimation problem in the regularization, in order to obtain better
prediction or better interpretability. Unfortunately, the literature on the
topic is is vast, and significantly different approaches are now refered to as
``structured sparsity''. We will present some aspects of this literature, and
focus particularly on convex approaches for structured sparse estimation.
References:
[1] F. Bach, R. Jenatton, J. Mairal and G. Obozinski. Structured Sparsity
through Convex Optimization. Statistical Science. 27(4). 2012
http://projecteuclid.org/euclid.ss/1356098550
We will discuss practical optimization algorithms for estimating sparse models,
both from a statistics and a signal processing point of view. First, we will
cover non-convex optimization techniques such as greedy algorithms and
DC-programming. Second, we will focus on convex optimization: first-order
methods, iterative reweighted least squares, and the homotopy method for the
Lasso.
References:
[1] Bach, F., Jenatton, R., Mairal, J., & Obozinski, G. (2012). Optimization
with sparsity-inducing penalties.Foundations and Trends in Machine Learning,
4(1), 1--106, 2012.
http://lear.inrialpes.fr/people/mairal/resources/pdf/ftml.pdf
In this talk I will give an overview over a number of projection based
randomized algorithms for scalable data analysis. While randomized methods
are frequently used, e.g. for Bayesian inference, their potential for
designing function classes, and for obtaining approximate solutions of
expensive estimation problem holds significant promise. Based on a number of
vignettes I will discuss basic principles and recent developments:
* The Bloom filter is an efficient structure to estimate set membership
that achieves high accuracy guarantees (zero false negatives and only a
small number of false positives) at minimal memory overhead. Extending
this to floating point numbers yields the Count (Farach et al, 2003) and
the CountMin (Cormode and Muthukrishnan, 2005) sketches that can be used
for frequency counts.
* The above methods are very useful when it comes to designing memory
efficient linear classifiers, leading to the class of Hash kernels for
documents, and other sparse feature collections (Weinberger et al.,
2009).
* Application of the same techniques to matrices yields both fast matrix
muliplication algorithms (Pagh 2012), and collaborative filtering
algorithms (Karatoglou et al., 2010).
* Shingles are an efficient tool for extracting random objects from a set
(Broder et al., 1997) in such a way as to provide fingerprints for fast
document comparison. This idea can be adapted to linear classification
in the form of Conditional Random Sampling (Li et al., 2006).
* Random projections can be used for drastic dimensionality reduction. In
the context of vectors this leads to locality sensitive hashing (Indyk
and Motwani, 1998) used e.g. for nearest neighbor retrieval. In the
context of matrices this leads to fast low-dimensional linear algebra
approximation techniques (Halko et al, 2010).
* Binary variants of such projections were studied by Goemans and
Williamson, 1998, and Charikar, 2005. They allow for fast reconstruction
of angles between vectors. These techniques can be employed in Bayesian
inference when computing inner products for exponential families (Ahmed
et al., 2012).
* For nonlinear function classes related techniques were proposed by Neal,
1994 and by Rahimi and Recht, 2008 in the form of Random Kitchen Sinks.
Random draws allow one to obtain approximations of many kernel functions
quite efficiently. A recent paper by Le et al, 2013 shows how memory
footprint and computation can be improved significantly.
and
Hands-on High-performance Statistical Computing Techniques
June 25, 2013 4:00 pm - 5:30 pm
Following a series of high-profile drug safety disasters in recent years, many countries are redoubling their efforts to ensure the safety of licensed medical products. Large-scale observational databases such as claims databases or electronic health record systems are attracting particular attention in this regard, but present significant methodological and computational concerns. Likewise, fusion of real-time satellite data with in situ sea surface temperature measurements for ecological modeling remains taxing for probabilistic spatial-temporal models on a global scale. In this talk, I discuss how high-performance statistical computation, including graphics processing units, can enable complex inference methods in these massive datasets. I focus on algorithm restructuring through techniques like block relaxation (Gibbs, cyclic coordinate descent, MM) to exploit increased data/parameter conditional independence within traditional serial structures. I find orders-of-magnitude improvement in overall run-time fitting models involving tens of millions of observations.
These approaches are ubiquitous in high-dimensional biological problems modeled through stochastic processes. To drive this point home, I conclude with a seemingly unrelated example developing nonparametric models to study the genomic evolution of infectious diseases. These infinite hidden Markov models (HMMs) generalize both Dirichlet process mixtures and the usual finite-state HMM to capture unknown heterogeneity in the evolutionary process. Data squashing strategies, coupled with massive parallelization, yield novel algorithms that bring these flexible models finally within our grasp. [Joint work with Subha Guha, Ricardo Lemos, David Madigan, Bruno Sanso and Steve Scott]
This lectures will cover two related L2-penalized regularization methods: Ridge Regression and SVM, one from the 40's and one from the 90's. And SVM is one of the two most successful machine learning methods together with Boosting.
Boosting is one of the two most successful machine learning methods
with SVM. It uses gradient descent to an empirical loss function.
When the step sizes are small, it is computationally efficient way
to approximate Lasso. When a nuclear norm penalization is applied to L2
loss,
we have the low-rank regularization arising from the Netflix competition.
A subset of the netflix data will be investigated.
In this lecture, we will discuss basic experimental design principles in data collection and issues regarding data quality. Specific data examples such as the entron data set will be used.
This lecture will discuss variants and extenstions of Lasso
such as Lasso+LS, adaptive Lasso, and group Lasso.
LS and Maximum Likelihood estimation (MLE) overfit when the dimension of
the model is not small relative to the sample size. This happens almost
always in high-dimensions. Regularziation often works by adding a penalty
to the fitting criterion as in classical model selection methods such as
AIC or BIC and L1-penalized LS called Lasso. We will also introduce Cross-validation (CV) for regularization parameter selection.
This lecture will generalize LS to weighted LS (WLS) and use WLS
to connect with generalized linear models including logistic
regression. Remote sensing data for cloud detection will be used.
This lecture reviews least squares (LS) method for linear fitting
and its statistical properites under various linear regression
model assumptions. Methods will be illustrated with real data examples
from instructor's research projects.
This lecture will cover data summarization and visualization tools
such as kernel estimation, loess, scatterplot and dimension
reduction via principal component analysis (PCA). Specific
data examples will be used.
This lecture will give a heuristic overview of theoretical
results on Lasso that explains when and why Lasso and extensions work.
This lecture will illustrate the power of the sparse coding principle and low-rank regularization in modeling neuron responses to natural images in the very challenging visual cortex area V4.