# Randomized Algorithms for Scalable Data Analysis

Friday, June 21, 2013 - 2:00pm - 3:30pm

Lind 305

Alexander Smola (Carnegie-Mellon University)

In this talk I will give an overview over a number of projection based

randomized algorithms for scalable data analysis. While randomized methods

are frequently used, e.g. for Bayesian inference, their potential for

designing function classes, and for obtaining approximate solutions of

expensive estimation problem holds significant promise. Based on a number of

vignettes I will discuss basic principles and recent developments:

* The Bloom filter is an efficient structure to estimate set membership

that achieves high accuracy guarantees (zero false negatives and only a

small number of false positives) at minimal memory overhead. Extending

this to floating point numbers yields the Count (Farach et al, 2003) and

the CountMin (Cormode and Muthukrishnan, 2005) sketches that can be used

for frequency counts.

* The above methods are very useful when it comes to designing memory

efficient linear classifiers, leading to the class of Hash kernels for

documents, and other sparse feature collections (Weinberger et al.,

2009).

* Application of the same techniques to matrices yields both fast matrix

muliplication algorithms (Pagh 2012), and collaborative filtering

algorithms (Karatoglou et al., 2010).

* Shingles are an efficient tool for extracting random objects from a set

(Broder et al., 1997) in such a way as to provide fingerprints for fast

document comparison. This idea can be adapted to linear classification

in the form of Conditional Random Sampling (Li et al., 2006).

* Random projections can be used for drastic dimensionality reduction. In

the context of vectors this leads to locality sensitive hashing (Indyk

and Motwani, 1998) used e.g. for nearest neighbor retrieval. In the

context of matrices this leads to fast low-dimensional linear algebra

approximation techniques (Halko et al, 2010).

* Binary variants of such projections were studied by Goemans and

Williamson, 1998, and Charikar, 2005. They allow for fast reconstruction

of angles between vectors. These techniques can be employed in Bayesian

inference when computing inner products for exponential families (Ahmed

et al., 2012).

* For nonlinear function classes related techniques were proposed by Neal,

1994 and by Rahimi and Recht, 2008 in the form of Random Kitchen Sinks.

Random draws allow one to obtain approximations of many kernel functions

quite efficiently. A recent paper by Le et al, 2013 shows how memory

footprint and computation can be improved significantly.

randomized algorithms for scalable data analysis. While randomized methods

are frequently used, e.g. for Bayesian inference, their potential for

designing function classes, and for obtaining approximate solutions of

expensive estimation problem holds significant promise. Based on a number of

vignettes I will discuss basic principles and recent developments:

* The Bloom filter is an efficient structure to estimate set membership

that achieves high accuracy guarantees (zero false negatives and only a

small number of false positives) at minimal memory overhead. Extending

this to floating point numbers yields the Count (Farach et al, 2003) and

the CountMin (Cormode and Muthukrishnan, 2005) sketches that can be used

for frequency counts.

* The above methods are very useful when it comes to designing memory

efficient linear classifiers, leading to the class of Hash kernels for

documents, and other sparse feature collections (Weinberger et al.,

2009).

* Application of the same techniques to matrices yields both fast matrix

muliplication algorithms (Pagh 2012), and collaborative filtering

algorithms (Karatoglou et al., 2010).

* Shingles are an efficient tool for extracting random objects from a set

(Broder et al., 1997) in such a way as to provide fingerprints for fast

document comparison. This idea can be adapted to linear classification

in the form of Conditional Random Sampling (Li et al., 2006).

* Random projections can be used for drastic dimensionality reduction. In

the context of vectors this leads to locality sensitive hashing (Indyk

and Motwani, 1998) used e.g. for nearest neighbor retrieval. In the

context of matrices this leads to fast low-dimensional linear algebra

approximation techniques (Halko et al, 2010).

* Binary variants of such projections were studied by Goemans and

Williamson, 1998, and Charikar, 2005. They allow for fast reconstruction

of angles between vectors. These techniques can be employed in Bayesian

inference when computing inner products for exponential families (Ahmed

et al., 2012).

* For nonlinear function classes related techniques were proposed by Neal,

1994 and by Rahimi and Recht, 2008 in the form of Random Kitchen Sinks.

Random draws allow one to obtain approximations of many kernel functions

quite efficiently. A recent paper by Le et al, 2013 shows how memory

footprint and computation can be improved significantly.

MSC Code:

68W20

Keywords: