Randomized Algorithms for Scalable Data Analysis

Friday, June 21, 2013 - 2:00pm - 3:30pm
Lind 305
Alexander Smola (Carnegie-Mellon University)
In this talk I will give an overview over a number of projection based
randomized algorithms for scalable data analysis. While randomized methods
are frequently used, e.g. for Bayesian inference, their potential for
designing function classes, and for obtaining approximate solutions of
expensive estimation problem holds significant promise. Based on a number of
vignettes I will discuss basic principles and recent developments:

* The Bloom filter is an efficient structure to estimate set membership
that achieves high accuracy guarantees (zero false negatives and only a
small number of false positives) at minimal memory overhead. Extending
this to floating point numbers yields the Count (Farach et al, 2003) and
the CountMin (Cormode and Muthukrishnan, 2005) sketches that can be used
for frequency counts.

* The above methods are very useful when it comes to designing memory
efficient linear classifiers, leading to the class of Hash kernels for
documents, and other sparse feature collections (Weinberger et al.,

* Application of the same techniques to matrices yields both fast matrix
muliplication algorithms (Pagh 2012), and collaborative filtering
algorithms (Karatoglou et al., 2010).

* Shingles are an efficient tool for extracting random objects from a set
(Broder et al., 1997) in such a way as to provide fingerprints for fast
document comparison. This idea can be adapted to linear classification
in the form of Conditional Random Sampling (Li et al., 2006).

* Random projections can be used for drastic dimensionality reduction. In
the context of vectors this leads to locality sensitive hashing (Indyk
and Motwani, 1998) used e.g. for nearest neighbor retrieval. In the
context of matrices this leads to fast low-dimensional linear algebra
approximation techniques (Halko et al, 2010).

* Binary variants of such projections were studied by Goemans and
Williamson, 1998, and Charikar, 2005. They allow for fast reconstruction
of angles between vectors. These techniques can be employed in Bayesian
inference when computing inner products for exponential families (Ahmed
et al., 2012).

* For nonlinear function classes related techniques were proposed by Neal,
1994 and by Rahimi and Recht, 2008 in the form of Random Kitchen Sinks.
Random draws allow one to obtain approximations of many kernel functions
quite efficiently. A recent paper by Le et al, 2013 shows how memory
footprint and computation can be improved significantly.
MSC Code: