Main navigation | Main content

HOME » PROGRAMS/ACTIVITIES » Annual Thematic Program

PROGRAMS/ACTIVITIES

Annual Thematic Program »Postdoctoral Fellowships »Hot Topics and Special »Public Lectures »New Directions »PI Programs »Industrial Programs »Seminars »Be an Organizer »Annual »Hot Topics »PI Summer »PI Conference »Applying to Participate »

May 6-9, 2003

**Inderjit
S. Dhillon** (Department of Computer Sciences, University
of Texas at Austin) inderjit@cs.utexas.edu
http://www.cs.utexas.edu/users/inderjit/

**Information-Theoretic
Clustering, Co-clustering and Matrix Approximations** Papers:
jmlrdist.pdf
jmlrdist.ps
kdd_cocluster.pdf
kdd_cocluster.ps

Slides: html
pdf
ps
ppt

Given
a two-dimensional co-occurrence matrix, we consider the problem
of co-clustering or simultaneous clustering of the rows and
columns. We view the (scaled) co-occurrence matrix as an empirical
joint probability distribution of two discrete random variables,
and pose the co-clustering problem as an optimization problem
in information theory --- the optimal co-clustering maximizes
the mutual information between the clustered random variables
subject to constraints on the number of row and column clusters.
It can be shown that this objective is identical to the objective
of finding a "low-parameter" non-negative matrix approximation
that is closest in relative entropy to the given matrix under
cluster constraints. Based on this formulation, we present
a co-clustering algorithm that monotonically increases the
preserved mutual information by intertwining both the row
and column clusterings at all stages. We apply the algorithm
to some problems in text mining: distributional clustering
of words applied to text classification, and simultaneous
clustering of words and documents. Experimental results indicate
that distributional word clustering leads to improved text
classification when the training data is small in size, and
co-clustering can overcome problems due to sparsity.

This is joint work with Yuqiang Guan, Subramanyam Mallela and Dharmendra Modha.

**Usama
Fayyad**, Ph.D. (President and CEO, digiMine, Inc.)
Usama@digiMine.com

**The
Business Evolution and Challenges of Data Mining**

Data Mining has received much attention as companies and organizations started to ask how they can better utilize the huge data stores they built up over the past few decades. While some interesting progress has been achieved over the past few years, especially when it comes to techniques and scalable algorithms, very few organizations have managed to benefit from the technology. This paradoxical situation of having too much data yet not be able to utilize it or mine it arose because of both technical and business challenges. We will cover these challenges, paint a picture for where the data problems are, and then introduce data mining and its role.

Rather than focusing on the algorithmic aspects of data mining: scaling to large databases, aspects and challenges for fitting data mining with database systems, we will cover business challenges of how to make the technology really work. In this coverage, we will span the spectrum from technical to business issues. We shall also cover applications in the business setting to illustrate the challenges and contributions of data mining in eBusinesses as an example. We conclude by revisiting the technical challenges facing the field.

Bio
of Usama Fayyad:

Dr.
Fayyad is co-founder, Chairman and CEO of digiMine, Inc. a
privately held company of over 100 employees focused on data
mining and business intelligence solutions. As CEO, he raised
$45 million in venture capital from top investors and has
grown the company to serve many large Fortune 500 clients.
Prior to digiMine, Dr. Fayyad founded and led Microsoft Research's
Data Mining & Exploration (DMX) Group. At Microsoft he also
led the development of data mining components within Microsoft
products, including SQL Server 2000. From 1989 to 1995, Dr.
Fayyad was at NASA's Jet Propulsion Laboratory (JPL), California
Institute of Technology, where he founded and grew a multi-million
dollar advanced research program to develop data mining systems
for the analysis of large scientific databases. This work
solved some significant scientific problems and earned him
numerous awards including the most distinguished excellence
award from Caltech/JPL and a U.S. Government Medal from NASA.
Dr. Fayyad has published over 100 technical articles and is
co-editor of two books on Data Mining and Knowledge Discovery
in Databases. He is very active in the KDD community and has
served as program co-chair of KDD-94 and KDD-95 (the 1st International
Conference on Knowledge Discovery and Data Mining) and as
general chair of KDD-96 and KDD-99. He is Editor-in-Chief
of the scientific technical journal: Data Mining and Knowledge
Discovery. He is a Director of the ACM Special Interest Group
on knowledge Discovery and Data Mining (SIGKDD) and Editor-in-Chief
of its newsletter: SIGKDD Explorations. He serves on several
Editorial Boards including the Communications of the ACM.
He holds a Ph.D. in Computer Science and Engineering from
The University of Michigan, Ann Arbor (1991) in addition to
M.Sc. in Mathematics (1989), MSE in Computer Engineering (1986),
and two B.S.E.'s in EE and CS (1984).

**Sudipto
Guha**
(Department of Computer and Information Science, University
of Pennsylvania) sudipto@central.cis.upenn.edu

**Approximate
Histogram Construction: Offline and Data Stream Algorithms**
Slides: pdf

Obtaining fast and accurate approximations to data distributions is a problem of central interest to database management. A variety of popular database applications including, approximate querying, similarity searching and data mining in most application domains, rely on such accurate approximations. Histogram based approximation is a very popular method in database theory and practice to succinctly represent a data distribution in a space efficient manner.

In this talk, we will consider various variants of optimal histogram construction. We will also investigate histogram construction for data streams.

**Thorsten
Joachims**
(Department of Computer Science, Cornell University) tj@cs.cornell.edu
http://www.cs.cornell.edu/People/tj/

**Transductive
Learning via Spectral Graph Partitioning** Slides:
pdf

Consider the following type of prediction problem: you have a large database of object (e.g. text documents) that you would like to organize according to some classification. You have the classification for a small subset of objects and you would like to infer the classes for the remaining objects. A typical example is relevance feedback in information retrieval: a user specifies some documents as relevant/non-relevant and wants to find all other relevant documents in the collection. This type of machine learning task is called transductive inference. A key difference to the typical inductive machine learning problem is that the location of the test points can be observed by the learning algorithm. I will show that exploiting cluster structure in the test set can help make more accurate predictions, in particular for text classification.

However, designing transductive learning algorithms that simultaneously minimize training error and maximize a measure of clustering in the test set is challenging. In this talk, I will present an approach based on spectral graph partitioning. In this approach, objects in the database form the nodes of a graph, while the edges represent dependencies. For such graphs, I will generalize ratio-cuts to find a clustering of the database that obeys the known classifications for the training examples. The relaxation of the resulting optimization problem can be solved as an eigenvalue problem. Empirical results show improved prediction accuracy especially for small training sets. Furthermore, the framework is very flexible, making it possible to implement co-training as a special case.

**Michael
I. Jordan** (Department of Electrical Engineering
and Computer Science, Department of Statistics, University
of California, Berkeley) jordan@cs.berkeley.edu

**Kernel
Methods, Graphical Models and Convex Optimization** Papers:
WaiJor_Semidef03tech.pdf
WaiJor_Semidef03tech.ps
csd-02-1202.pdf
csd-02-1202.ps
csd-02-1206.pdf
csd-02-1206.ps

A pervasive problem in data analysis is that of integrating data from multiple, heterogeneous sources. Data sources may have varying formats, semantics and degrees of reliability. I illustrate this problem in two areas: one involving the annotation of images, and the other involving the prediction of the cellular location of proteins. In the case of image annotation, I describe a methodology for solving data integration problems that is based on probabilistic graphical models, a formalism that exploits the conjoined talents of graph theory and probability theory to build complex models out of simpler pieces. I show how convex relaxations can be used to solve the probabilistic inference problem that is at the core of the graphical model approach. In the case of protein annotation, I discuss an approach based on "kernel methods," an area of machine learning that also makes significant use of convex optimization techniques. I show how multiple kernels can be combined, yielding a problem that is still within the scope of convex optimization.

(with David Blei, Nello Cristianini, Gert Lanckriet, William Noble and Martin Wainwright)

**Adam
Kalai**
(Applied Mathematics, Massachusetts Institute of Technology)
akalai@mit.edu

**Algorithms
for Online Optimization Problems
** Slides:
pdf
ps
Paper: pdf
ps

In an online optimization problem, one makes a series of decisions without knowledge of future data. As an example, a factory may have a fixed feasible set of production options. Each month, the factory must decide how many of each type of item to produce, and only after production do they find out the profits of each item. A second example would be: each day you have to drive from home to work, and only afterwards do you find out the times on each of the roads. The goals are to maximize total profits, in the first case, and minimize total time, in the second.

A standard measure of regret is relative to the best decision one would have made in hindsight, i.e. compared to the best single choice to use if you had to use the same choice every day. For a wide variety of online optimization problems, it is known that one can guarantee low expected regret relative to the best decision in hindsight. We demonstrate this and show that one can also compute these online decisions as efficiently as one can compute the best decision in hindsight.

**Ravi
Kannan** (Department of Computer Science, Yale University)
kannan@cs.yale.edu

**Sampling on the Fly from Massive Data** Slides:
pdf
ps

An approach to problems where the data is too large to be stored in Random Access Memory let alone be analyzed in full is Sampling. The simplest kind of sampling, namely Uniform Sampling can be implemented in one pass through the data. Recently, algorithms for solving approximately a host of graph and combinatorial problems which use a surprisingly small uniformly drawn sample have been developed.

There has also been progress on using non-uniform sampling in the "streaming'' model, which allows just one pass through the data. Clustering and frequency estimation problems have been thus tackled.

We will also consider problems, where, the data consists of a large matrix which can be stored in external memory and read a limited number of times. We will present algorithms to find a succinct synopsis of the matrix, storable in RAM, from which, later, we may answer Similarity (arising in Information Retrieval) and other queries fast.

**Robert
Krauthgamer**
(University of California at Berkeley and International Computer
Institute (ICSI)) robi@cs.berkeley.edu
http://www.cs.berkeley.edu/~robi/

**Navigating
Nets: Simple Algorithms for Proximity Search**

Nearest-neighbor
search is the following problem: Given a set S of n points
in a metric space X, preprocess S so as to efficiently find
the point in S that is closest to a query point q
X. Recently, there has been a significant interest
in the case of general metrics, because in many practical
applications the underlying metric is not Euclidean (or another
l_{p}-metric).

We devise for this problem a simple and efficient dynamic data structure (i.e., supports insertions and deletions to S) that answers 1+ approximate nearest neighbor queries for any > 0. Our algorithms are deterministic and provably correct with no assumptions on the metric. Their (time and space) complexity depends on the dimensionality of the data set S, which we measure using a very natural notion borrowed from functional analysis. For example, if S has constant dimension, the above operations take logarithmic time (for fixed > 0). We further show that the dependency of our bounds on the dimension is near-optimal in a certain oracle model. Finally, our data structure outperforms the metric skip list of Karger and Ruhl [STOC, 2002] by being deterministic, dimension oblivious, and conceptually simpler.

[Joint work with James R. Lee.]

**Lillian
Lee**
(Department of Computer Science, Cornell University) llee@cs.cornell.edu
http://www.cs.cornell.edu/home/llee

**The
Iterative Residual Rescaling algorithm: An analysis and generalization
of Latent Semantic Indexing** Paper:
pdf
ps

Using vector-based representations of document collections has enabled the application of powerful dimension-reduction techniques to information retrieval, document clustering, and other text analysis tasks. One of the most prominent of these techniques is Latent Semantic Indexing (LSI). However, despite ample empirical experience with it, there is still little understanding of when LSI can -- and, just as importantly, cannot -- be expected to perform well. This talk consists of two parts. First, after a self-contained introduction to LSI, we provide a novel formal analysis of it that links its performance in a precise way to the uniformity of the underlying topic-document distribution. Second, we present a new algorithm, Iterative Residual Rescaling (IRR), that corrects for skewed distributions by automatically adjusting to non-uniformity in the topic distribution without knowledge of the underlying topics. A series of experiments over a variety of evaluation metrics validates our analysis and demonstrates the effectiveness of IRR. Joint work with Rie Kubota

Joint work with Rie Kubota Ando.

**Milena
Mihail**
(College of Computing, Georgia Institute of Technology) mihail@cc.gatech.edu
http://www.cc.gatech.edu/fac/Milena.Mihail/

**Conductance
and Spectra of Power Law and Scale Free Graphs **

We generalize the classical result of expansion of random regular graphs for graphs whose degree sequences follow heavy tailed statistics. Such graphs arise in complex networks like the Internet, the WWW and biological networks. In particular, we establish conductance Omega(1/logn) in the configurational random graph model and conductance Omega(1) the evolutionary model of growth with preferential attachment, under the assumption that the minimum degree is at least 3 in the former and at least 2 in the latter. A structural implication of our results is separation of the 1st and 2nd eigenvalues of the stochastic normalization of the adjacency matrix of the graph. The algorithmic implications of our results include low congestion routing on the Internet and fast crawling on the WWW. We give complementary results for the spectrum of the non-normalized adjacency matrix of the graph, with implications in information retrieval.

Joint work with C. Gkantsidis, C. Papadimitriou and A. Saberi.

**Sridhar
Rajagopalan** (IBM Almaden Research Center) sridhar@almaden.ibm.com

**Practical
Models for Large Data Set Analysis** Slides:
html
pdf
ps
ppt

In this talk, I will introduce two practical models for large data set analysis. The first is derived from the streaming computational model by introducing a sort primitive. The second is a more formal model, streaming networks. We will discuss three issues in the new context. (1) Show that the two models, though apparently different, are essentially the same. (2) Describe efficient solutions to several combinatorial optimization problems. (3) Describe our experiences in implementing these models on modern commodity hardware.

I will then argue that because the model is "stable" from a complexity theory point of view, amenable to efficient and cost effective implementation, and general enough to encompass a wide range of problems, it can be the basis for a computational platform for large scale data analysis.

**Vijay
V. Vazirani**
(College of Computing, Georgia Institute of Technology) vazirani@cc.gatech.edu
http://www.cc.gatech.edu/fac/Vijay.Vazirani/

**How
Intractable is the "Invisible Hand'': Polynomial Time
Algorithms for Market Equilibria** Papers:
market.pdf
market.ps
EC3.pdf
EC3.ps
Slides: html
pdf
ps
ppt

Although the study of market equilibria has occupied center stage within Mathematical Economics for over a century, polynomial time algorithms for such questions have so far evaded researchers. It has been customary to relegate such efficiency issues to the "Invisible Hand'' of the market, a notion propounded by Adam Smith (Wealth of Nations, 1776). In the context of Algorithmic Game Theory, a nascent area attempting to address new issues arising from the Internet, we provide the first polynomial time algorithm for the linear version of a problem defined by Irving Fisher in 1891.

Although the linear case provides an excellent starting point, it is not general enough to capture some of the more attractive aspects of pricing mechanisms. To study these, we define a new model which takes into consideration spending constraints among buyers. Our algorithms are inspired by the primal-dual schema from combinatorial optimization.

(First result joint with Devanur, Papadimitriou, and Saberi, and available at http://www.cc.gatech.edu/fac/Vijay.Vazirani)

**Santosh
Vempala**
(Department of Mathematics, Massachusetts Institute of Technology)
vempala@math.mit.edu

**On
the Spectral Method for Clustering Mixture Models** Slides:
html
Papers: mixtures.pdf
mixtures.ps
specfocs.pdf
specfocs.ps

A mixture model is a weighted combination of probability distributions. We consider the problem of classifying a sample according to the underlying component distributions and focus on a simple spectral algorithm. The success of the algorithm depends on a geometric separation condition between the means of the component distributions. For a mixture of $k$ spherical Gaussians, the algorithm achieves asymptotically optimal bounds. We will also discuss its performance for general mixtures.

Joint work with R. Kannan and G. Wang.

**Xiangrong
Yin**
(Department of Statistics, University of Georgia) xryin@stat.uga.edu
http://www.stat.uga.edu/~xryin/

I**nformation Extraction: A Dimension Reduction Technique
Based on Information Theory** (poster
session)

Modern graphical tools have enhanced our ability to learn many things from data directly, but the issue is what to plot with a high-dimensional data set. Thus dimension reduction shows its importance. Based on information theory, here we develop a dimension reduction technique for extracting important information. Our method in general could be applied in any area relating to information extraction. It involves density estimation which can be estimated non-parametrically. And that is feasible with the help of fast computing techniques. We provide theoretical justification for connecting with central subspace. Illustrative examples are presented.