Matrix Approximations for Large, Sparse Text Data Using Clustering

Monday, April 17, 2000 - 11:15am - 11:40am
Keller 3-180
Inderjit Dhillon (The University of Texas at Austin)
Large text collections warrant new, effective and efficient methods that allow easy assimilation, retrieval and navigation. Such methods include automatic clustering of documents into conceptual categories, intelligent query retrieval, and classification of documents into pre-defined categories.

Towards this end, I will present a geometric k-means like clustering algorithm (after converting the text to a vector-space model) to directly identify the main latent concepts in the document collection. This clustering algorithm directly gives a partitioning of the documents. Additionally, it yields a concept-revealing linear subspace of the vector-space model, and a new matrix decomposition of the term-document matrix, which we term concept decomposition. This linear subspace can be used for better query retrieval in a manner analogous to the LSI (Latent Semantic Indexing) method. We analytically and empirically investigate the relationship between this new subspace and the invariant subspaces of LSI. An advantage of our method is its computational efficiency, especially when compared to the expensive eigenvector computations required in LSI. I will present experimental results that validate these claims. If time permits, I will also present a scheme to visualize proximity relationships among the text documents.