Document Clustering: Is Hierarchical Clustering Really Better?

Monday, April 17, 2000 - 2:45pm - 3:10pm
Keller 3-180
Michael Steinbach (University of Minnesota, Twin Cities)
In this presentation we compare the two main approaches to document clustering: agglomerative hierarchical clustering and K-means. Hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. In contrast, K-means has a time complexity which is linear in the number of documents, but is thought to produce inferior clusters. Sometimes the two approaches are combined so as to get the best of both worlds. This paper presents research that disputes the view that hierarchical clustering produces better quality document clusters than K-means or its variants. In particular, we give results which show that a simple and efficient bisecting K-means technique is as good or better than the pure or hybrid hierarchical approaches we tested. While our results are limited in scope, we believe that researchers interested in document clustering will find them quite interesting.