During the last decade, we have seen an explosive growth in database technology and amount of data we collected. Advances in data collection, use of bar codes in commercial outlets and the computerization of business transactions have flooded us with lots of data, and generated an urgent need to analyze this data to extract more intelligent and useful information. Data mining is the efficient and possibly unsupervised discovery of interesting, useful and previously unknown patterns from this data. Common patterns of interest include classification, associations, clustering and sequential patterns. In this talk, we will present parallel algorithms to discover classification trees and association rules.
We present parallel formulations of classification-rule-learning algorithm based on induction. We will present two basic parallel formulation, one is based on Synchronous Tree Construction Approach and the other is based on the Partitioned Tree Construction Approach. We discuss the advantages and disadvantages of using these methods and propose a hybrid method that employs the good features of these methods. We will also talk about how to handle continuous attributes efficiently for this task.
We also discuss two parallel formulations, the count distribution method and the data distribution method, for the computation of association rules. The count distribution method scales with data size, but does not scale with main-memory usage. The data distribution method is supposed to scale with data size and main memory, but suffers from high communication overhead and duplicated work. We will present a new technique, that is an improvement of the data distribution method. This method scales with data size and main memory, and it does not incur high communication overhead and does not have a problem with duplicated work.
This is joint work with E. Han, G. Karypis, A. Srivastava
and V. Singh.
Back to Workshop Schedule