Team 3: Fast nearest neighbor search in massive high-dimensional sparse data sets

Wednesday, August 3, 2011 - 10:20am - 10:40am
Keller 3-180
Sanjiv Kumar (Google Inc.)
Project Description:


Driven by rapid advances in many fields including Biology, Finance and Web Services, applications involving millions or even billions of data items such as documents, user records, reviews, images or videos are not that uncommon. Given a query from a user, fast and accurate retrieval of relevant items from such massive data sets is of critical importance. Each item in a data set is typically represented by a feature vector, possibly in a very high dimensional space. Moreover, such a vector tends to be sparse for many applications. For instance, text documents are encoded as a word frequency vector. Similarly, images and videos are commonly represented as sparse histograms of a large number of visual features. Many techniques have been proposed in the past for fast nearest neighbor search. Most of these can be divided in two paradigms: Specialized data structures (e.g., trees), and hashing (representing each item as a compact code). Tree-based methods scale poorly with dimensionality, typically reducing to worst case linear search. Hashing based methods are popular for large-scale search but learning accurate and fast hashes for high-dimensional sparse data is still an open question.

In this project, we aim to focus on fast approximate nearest neighbor search in massive databases by converting each item as a binary code. Locality Sensitive Hashing (LSH) is one of the most prominent methods that uses randomized projections to generate simple hash functions. However, LSH usually requires long codes for good performance. The main challenge of this project is how to learn appropriate hash functions that take input data distribution into consideration. This will lead to more compact codes, thus reducing the storage and computational needs significantly. The project will focus on understanding and implementing a few state-of-the-art hashing methods, developing the formulation for learning data-dependent hash functions assuming a known data density, and experimenting with medium to large scale datasets.


Approximate Nearest Neighbor (ANN) search, Hashing, LSH, Sparse data, High-dimensional hashing


For a quick overview of ANN search, review the following tutorials (more references are given at the end of the tutorials):




- Good computing skills (Matlab or C/C++)

- Strong background in optimization, linear algebra and calculus

- Machine learning and computational geometry background preferred but not necessary
MSC Code: