Main navigation | Main content
HOME » PROGRAMS/ACTIVITIES » Annual Thematic Program
Dharmendra
S. Modha
IBM Almaden
dmodha@almaden.ibm.com
Clustering separates unrelated documents and
groups related documents, and is useful for discrimination,
disambiguation, summarization, organization, and navigation
of unstructured collections of hypertext documents. We propose
a novel clustering algorithm that clusters hypertext documents
using words (contained in the document), out-links (from the
document), and in-links (to the document). The algorithm automatically
determines the relative importance of words, out-links, and
in-links for a given collection of hypertext documents. We
annotate each cluster using six information nuggets: summary,
breakthrough, review, keywords, citation,
and reference. These nuggets constitute high-quality
information resources that are representatives of the content
of the clusters, and are extremely effective in compactly
summarizing and navigating the collection of hypertext documents.
We employ web searching as an application to illustrate our
results. Anecdotally, when applied to the documents returned
by AltaVista in responses to the query abduction, our algorithm
separates documents about "alien abduction" from
those about "child abduction."
This is joint work with W. Scott Spangler and will appear in the ACM Hypertext 2000 Conference.
Material from IMA Talk slides: pdf postscript paper: pdf postscript
|
|
|
|
|