Information retrieval models such as Latent Semantic Indexing (LSI) have been demonstrated to outperform lexical matching in the presence of synonymy and polysemy. Unfortunately, such rank-reduced models require the decomposition (e.g., singular value decomposition) of large term-by- document matrices which can prohibit scalable information retrieval. This particular work demonstrates how information filtering using level search techniques can reduce the computational cost of LSI. For each query, level search extracts a much smaller subset of the original term-by-document matrix, containing on average 25% of the original non-zero entries. When LSI is applied to such subsets, the average precision degrades only by 5% due to level search filtering. Moreover, for some document collections an increase in precision has also been observed. Further enhancement of level search can be based on a "pruning" scheme which deletes terms connected to only one document from the query-specific submatrix. Such pruning has achieved a 65% reduction (on average) in the number of non-zeros with a precision loss of no more than 5% for most collections.
This work is joint with Xiaoyan Zhang and Padma Raghavan, University of Tennessee.