Talk abstract:
Parallel Data Mining Algorithms
Vipin Kumar, University of Minnesota
During the last decade, we have seen an explosive growth in
database technology and amount of data we collected. Advances
in data collection, use of bar codes in commercial outlets and
the computerization of business transactions have flooded us
with lots of data, and generated an urgent need to analyze this
data to extract more intelligent and useful information. Data
mining is the efficient and possibly unsupervised discovery
of interesting, useful and previously unknown patterns from
this data. Common patterns of interest include classification,
associations, clustering and sequential patterns. In this talk,
we will present parallel algorithms to discover classification
trees and association rules.
We present parallel formulations of classification-rule-learning
algorithm based on induction. We will present two basic parallel
formulation, one is based on Synchronous Tree Construction Approach
and the other is based on the Partitioned Tree Construction
Approach. We discuss the advantages and disadvantages of using
these methods and propose a hybrid method that employs the good
features of these methods. We will also talk about how to handle
continuous attributes efficiently for this task.
We also discuss two parallel formulations, the count distribution
method and the data distribution method, for the computation
of association rules. The count distribution method scales with
data size, but does not scale with main-memory usage. The data
distribution method is supposed to scale with data size and
main memory, but suffers from high communication overhead and
duplicated work. We will present a new technique, that is an
improvement of the data distribution method. This method scales
with data size and main memory, and it does not incur high communication
overhead and does not have a problem with duplicated work.
This is joint work with E. Han, G. Karypis, A. Srivastava
and V. Singh.
Back to Workshop Schedule
1996-1997
Mathematics in High Performance Computing
|