Content-Sensitive Fingerprinting and its Applications

Friday, April 6, 2001 - 10:10am - 11:00am
Vincent 570
Rafail Ostrovsky (Telcordia)
A standard notion of a hash-function is the one that maps all documents to short random-looking outputs. That is, even if two documents differ only in a few bits, a classical hash function is geared to output unrelated results. In many settings, there is a need for hash-functions which produce similar short outputs for similar documents. This is especially relevant when searching for a related documents in a large database, where one must find answers faster then searching through the entire database, with applications to information distillation and data mining algorithms.

In this talk, I will show how to construct such a hash function and explain the underling ideas of the algorithm. I will then show applications of such hash function to approximate searching, information distillation, data-mining algorithms, dimension-reduction techniques, clustering, filtering, approximate matching, facility location, nearest neighbor search and other approximation problems. The talk will be self-contained.