HOME    »    PROGRAMS/ACTIVITIES    »    Annual Thematic Program
Talk Abstract
Clustering Hypertext with Applications to Web Searching

Dharmendra S. Modha
IBM Almaden
dmodha@almaden.ibm.com


Clustering separates unrelated documents and groups related documents, and is useful for discrimination, disambiguation, summarization, organization, and navigation of unstructured collections of hypertext documents. We propose a novel clustering algorithm that clusters hypertext documents using words (contained in the document), out-links (from the document), and in-links (to the document). The algorithm automatically determines the relative importance of words, out-links, and in-links for a given collection of hypertext documents. We annotate each cluster using six information nuggets: summary, breakthrough, review, keywords, citation, and reference. These nuggets constitute high-quality information resources that are representatives of the content of the clusters, and are extremely effective in compactly summarizing and navigating the collection of hypertext documents. We employ web searching as an application to illustrate our results. Anecdotally, when applied to the documents returned by AltaVista in responses to the query abduction, our algorithm separates documents about "alien abduction" from those about "child abduction."

This is joint work with W. Scott Spangler and will appear in the ACM Hypertext 2000 Conference.

 

Material from IMA Talk slides: pdf   postscript    paper: pdf   postscript  

Back to Workshop Schedule

Back to IMA "HOT TOPICS" Workshop: Text Mining

Go