Main navigation | Main content

HOME » PROGRAMS/ACTIVITIES » Annual Thematic Program

PROGRAMS/ACTIVITIES

Annual Thematic Program »Postdoctoral Fellowships »Hot Topics and Special »Public Lectures »New Directions »PI Programs »Math Modeling »Seminars »Be an Organizer »Annual »Hot Topics »PI Summer »PI Conference »Applying to Participate »

Talk Abstract

Matrix Approximations for Large, Sparse Text Data Using Clustering

Matrix Approximations for Large, Sparse Text Data Using Clustering

Department of Computer Sciences

UT Austin

inderjit@cs.utexas.edu

Large text collections warrant new, effective and efficient methods that allow easy assimilation, retrieval and navigation. Such methods include automatic clustering of documents into conceptual categories, "intelligent" query retrieval, and classification of documents into pre-defined categories.

Towards this end, I will present a geometric k-means like clustering algorithm (after converting the text to a vector-space model) to directly identify the main latent concepts in the document collection. This clustering algorithm directly gives a partitioning of the documents. Additionally, it yields a "concept-revealing" linear subspace of the vector-space model, and a new matrix decomposition of the term-document matrix, which we term "concept decomposition". This linear subspace can be used for better query retrieval in a manner analogous to the LSI (Latent Semantic Indexing) method. We analytically and empirically investigate the relationship between this new subspace and the invariant subspaces of LSI. An advantage of our method is its computational efficiency, especially when compared to the expensive eigenvector computations required in LSI. I will present experimental results that validate these claims. If time permits, I will also present a scheme to "visualize" proximity relationships among the text documents.