New Algorithms for Similarity Search in High Dimensions

Thursday, September 15, 2016 - 10:00am - 10:50am
Keller 3-180
Piotr Indyk (Massachusetts Institute of Technology)
Similarity search is a fundamental computational task that involves searching for similar items in a large collection of objects. This task is often formulated as the nearest neighbor problem: given a database of n points in a d-dimensional space, devise a data structure that, given any query point, quickly finds its nearest neighbor in the database. The problem has a remarkable number of applications in a variety of fields, such as machine learning, databases, natural language processing and computer vision. Many of those applications involve data sets that are very high dimensional. Unfortunately, all known algorithms for this problem require query time or data structure size that are exponential in the dimension, which makes them inefficient when the dimension is high enough . As a result, over the last two decades there has been a considerable effort focused on developing approximate algorithms that can overcome this curse of dimensionality.

A popular framework for designing such algorithms is Locality Sensitive Hashing (LSH). It relies on the existence of efficiently computable random mappings (LSH functions) with the property that the probability of collision between two points is related to the distance between them. The framework is applicable to a wide range of distances and similarity functions, including the Euclidean distance. For the latter metric, it is known that the basic application of the LSH function yields e.g., a 2-approximate algorithm with a query time of roughly dn^(1/4), for a set of n points in a d-dimensional space.

In this talk I will describe recent data structures that offer significant improvements over the aforementioned bounds. The improvement is achieved by performing *data-dependent* hashing, which constructs the random hash functions in a way that depends on the data distribution. If time allows I will also describe a recent implementation of the core component in the aforementioned algorithms, which empirically outperforms widely used variants of LSH. The implementation is available at