Talk
Abstract:
Matrix Approximations for Large, Sparse Text Data Using Clustering
Inderjit
S. Dhillon
Department of Computer Sciences
UT Austin
inderjit@cs.utexas.edu
Large text collections warrant new, effective and efficient
methods that allow easy assimilation, retrieval and navigation.
Such methods include automatic clustering of documents into
conceptual categories, "intelligent" query retrieval, and classification
of documents into pre-defined categories.
Towards this end, I will present a geometric k-means like clustering
algorithm (after converting the text to a vector-space model)
to directly identify the main latent concepts in the document
collection. This clustering algorithm directly gives a partitioning
of the documents. Additionally, it yields a "concept-revealing"
linear subspace of the vector-space model, and a new matrix
decomposition of the term-document matrix, which we term "concept
decomposition". This linear subspace can be used for better
query retrieval in a manner analogous to the LSI (Latent Semantic
Indexing) method. We analytically and empirically investigate
the relationship between this new subspace and the invariant
subspaces of LSI. An advantage of our method is its computational
efficiency, especially when compared to the expensive eigenvector
computations required in LSI. I will present experimental results
that validate these claims. If time permits, I will also present
a scheme to "visualize" proximity relationships among the text
documents.
Back to Workshop Schedule
Back to IMA "HOT TOPICS" Workshop: Text Mining
|