This is joint work with Yuqiang Guan, Subramanyam Mallela and Dharmendra Modha.
Usama Fayyad, Ph.D. (President and CEO, digiMine, Inc.) Usama@digiMine.com
The Business Evolution and Challenges of Data Mining
Data Mining has received much attention as companies and organizations started to ask how they can better utilize the huge data stores they built up over the past few decades. While some interesting progress has been achieved over the past few years, especially when it comes to techniques and scalable algorithms, very few organizations have managed to benefit from the technology. This paradoxical situation of having too much data yet not be able to utilize it or mine it arose because of both technical and business challenges. We will cover these challenges, paint a picture for where the data problems are, and then introduce data mining and its role.
Rather than focusing on the algorithmic aspects of data mining: scaling to large databases, aspects and challenges for fitting data mining with database systems, we will cover business challenges of how to make the technology really work. In this coverage, we will span the spectrum from technical to business issues. We shall also cover applications in the business setting to illustrate the challenges and contributions of data mining in eBusinesses as an example. We conclude by revisiting the technical challenges facing the field.
Sudipto Guha (Department of Computer and Information Science, University of Pennsylvania) firstname.lastname@example.org
Approximate Histogram Construction: Offline and Data Stream Algorithms Slides: pdf
Obtaining fast and accurate approximations to data distributions is a problem of central interest to database management. A variety of popular database applications including, approximate querying, similarity searching and data mining in most application domains, rely on such accurate approximations. Histogram based approximation is a very popular method in database theory and practice to succinctly represent a data distribution in a space efficient manner.
In this talk, we will consider various variants of optimal histogram construction. We will also investigate histogram construction for data streams.
Transductive Learning via Spectral Graph Partitioning Slides: pdf
Consider the following type of prediction problem: you have a large database of object (e.g. text documents) that you would like to organize according to some classification. You have the classification for a small subset of objects and you would like to infer the classes for the remaining objects. A typical example is relevance feedback in information retrieval: a user specifies some documents as relevant/non-relevant and wants to find all other relevant documents in the collection. This type of machine learning task is called transductive inference. A key difference to the typical inductive machine learning problem is that the location of the test points can be observed by the learning algorithm. I will show that exploiting cluster structure in the test set can help make more accurate predictions, in particular for text classification.
However, designing transductive learning algorithms that simultaneously minimize training error and maximize a measure of clustering in the test set is challenging. In this talk, I will present an approach based on spectral graph partitioning. In this approach, objects in the database form the nodes of a graph, while the edges represent dependencies. For such graphs, I will generalize ratio-cuts to find a clustering of the database that obeys the known classifications for the training examples. The relaxation of the resulting optimization problem can be solved as an eigenvalue problem. Empirical results show improved prediction accuracy especially for small training sets. Furthermore, the framework is very flexible, making it possible to implement co-training as a special case.
Michael I. Jordan (Department of Electrical Engineering and Computer Science, Department of Statistics, University of California, Berkeley) email@example.com
A pervasive problem in data analysis is that of integrating data from multiple, heterogeneous sources. Data sources may have varying formats, semantics and degrees of reliability. I illustrate this problem in two areas: one involving the annotation of images, and the other involving the prediction of the cellular location of proteins. In the case of image annotation, I describe a methodology for solving data integration problems that is based on probabilistic graphical models, a formalism that exploits the conjoined talents of graph theory and probability theory to build complex models out of simpler pieces. I show how convex relaxations can be used to solve the probabilistic inference problem that is at the core of the graphical model approach. In the case of protein annotation, I discuss an approach based on "kernel methods," an area of machine learning that also makes significant use of convex optimization techniques. I show how multiple kernels can be combined, yielding a problem that is still within the scope of convex optimization.
(with David Blei, Nello Cristianini, Gert Lanckriet, William Noble and Martin Wainwright)
Adam Kalai (Applied Mathematics, Massachusetts Institute of Technology) firstname.lastname@example.org
In an online optimization problem, one makes a series of decisions without knowledge of future data. As an example, a factory may have a fixed feasible set of production options. Each month, the factory must decide how many of each type of item to produce, and only after production do they find out the profits of each item. A second example would be: each day you have to drive from home to work, and only afterwards do you find out the times on each of the roads. The goals are to maximize total profits, in the first case, and minimize total time, in the second.
A standard measure of regret is relative to the best decision one would have made in hindsight, i.e. compared to the best single choice to use if you had to use the same choice every day. For a wide variety of online optimization problems, it is known that one can guarantee low expected regret relative to the best decision in hindsight. We demonstrate this and show that one can also compute these online decisions as efficiently as one can compute the best decision in hindsight.
Ravi Kannan (Department of Computer Science, Yale University) email@example.com
An approach to problems where the data is too large to be stored in Random Access Memory let alone be analyzed in full is Sampling. The simplest kind of sampling, namely Uniform Sampling can be implemented in one pass through the data. Recently, algorithms for solving approximately a host of graph and combinatorial problems which use a surprisingly small uniformly drawn sample have been developed.
There has also been progress on using non-uniform sampling in the "streaming'' model, which allows just one pass through the data. Clustering and frequency estimation problems have been thus tackled.
We will also consider problems, where, the data consists of a large matrix which can be stored in external memory and read a limited number of times. We will present algorithms to find a succinct synopsis of the matrix, storable in RAM, from which, later, we may answer Similarity (arising in Information Retrieval) and other queries fast.
Navigating Nets: Simple Algorithms for Proximity Search
Nearest-neighbor search is the following problem: Given a set S of n points in a metric space X, preprocess S so as to efficiently find the point in S that is closest to a query point q X. Recently, there has been a significant interest in the case of general metrics, because in many practical applications the underlying metric is not Euclidean (or another lp-metric).
We devise for this problem a simple and efficient dynamic data structure (i.e., supports insertions and deletions to S) that answers 1+ approximate nearest neighbor queries for any > 0. Our algorithms are deterministic and provably correct with no assumptions on the metric. Their (time and space) complexity depends on the dimensionality of the data set S, which we measure using a very natural notion borrowed from functional analysis. For example, if S has constant dimension, the above operations take logarithmic time (for fixed > 0). We further show that the dependency of our bounds on the dimension is near-optimal in a certain oracle model. Finally, our data structure outperforms the metric skip list of Karger and Ruhl [STOC, 2002] by being deterministic, dimension oblivious, and conceptually simpler.
[Joint work with James R. Lee.]
Using vector-based representations of document collections has enabled the application of powerful dimension-reduction techniques to information retrieval, document clustering, and other text analysis tasks. One of the most prominent of these techniques is Latent Semantic Indexing (LSI). However, despite ample empirical experience with it, there is still little understanding of when LSI can -- and, just as importantly, cannot -- be expected to perform well. This talk consists of two parts. First, after a self-contained introduction to LSI, we provide a novel formal analysis of it that links its performance in a precise way to the uniformity of the underlying topic-document distribution. Second, we present a new algorithm, Iterative Residual Rescaling (IRR), that corrects for skewed distributions by automatically adjusting to non-uniformity in the topic distribution without knowledge of the underlying topics. A series of experiments over a variety of evaluation metrics validates our analysis and demonstrates the effectiveness of IRR. Joint work with Rie Kubota
Joint work with Rie Kubota Ando.
Conductance and Spectra of Power Law and Scale Free Graphs
We generalize the classical result of expansion of random regular graphs for graphs whose degree sequences follow heavy tailed statistics. Such graphs arise in complex networks like the Internet, the WWW and biological networks. In particular, we establish conductance Omega(1/logn) in the configurational random graph model and conductance Omega(1) the evolutionary model of growth with preferential attachment, under the assumption that the minimum degree is at least 3 in the former and at least 2 in the latter. A structural implication of our results is separation of the 1st and 2nd eigenvalues of the stochastic normalization of the adjacency matrix of the graph. The algorithmic implications of our results include low congestion routing on the Internet and fast crawling on the WWW. We give complementary results for the spectrum of the non-normalized adjacency matrix of the graph, with implications in information retrieval.
Joint work with C. Gkantsidis, C. Papadimitriou and A. Saberi.
Sridhar Rajagopalan (IBM Almaden Research Center) firstname.lastname@example.org
In this talk, I will introduce two practical models for large data set analysis. The first is derived from the streaming computational model by introducing a sort primitive. The second is a more formal model, streaming networks. We will discuss three issues in the new context. (1) Show that the two models, though apparently different, are essentially the same. (2) Describe efficient solutions to several combinatorial optimization problems. (3) Describe our experiences in implementing these models on modern commodity hardware.
I will then argue that because the model is "stable" from a complexity theory point of view, amenable to efficient and cost effective implementation, and general enough to encompass a wide range of problems, it can be the basis for a computational platform for large scale data analysis.
Although the study of market equilibria has occupied center stage within Mathematical Economics for over a century, polynomial time algorithms for such questions have so far evaded researchers. It has been customary to relegate such efficiency issues to the "Invisible Hand'' of the market, a notion propounded by Adam Smith (Wealth of Nations, 1776). In the context of Algorithmic Game Theory, a nascent area attempting to address new issues arising from the Internet, we provide the first polynomial time algorithm for the linear version of a problem defined by Irving Fisher in 1891.
Although the linear case provides an excellent starting point, it is not general enough to capture some of the more attractive aspects of pricing mechanisms. To study these, we define a new model which takes into consideration spending constraints among buyers. Our algorithms are inspired by the primal-dual schema from combinatorial optimization.
(First result joint with Devanur, Papadimitriou, and Saberi, and available at http://www.cc.gatech.edu/fac/Vijay.Vazirani)
Santosh Vempala (Department of Mathematics, Massachusetts Institute of Technology) email@example.com
A mixture model is a weighted combination of probability distributions. We consider the problem of classifying a sample according to the underlying component distributions and focus on a simple spectral algorithm. The success of the algorithm depends on a geometric separation condition between the means of the component distributions. For a mixture of $k$ spherical Gaussians, the algorithm achieves asymptotically optimal bounds. We will also discuss its performance for general mixtures.
Joint work with R. Kannan and G. Wang.
Information Extraction: A Dimension Reduction Technique Based on Information Theory (poster session)
Modern graphical tools have enhanced our ability to learn many things from data directly, but the issue is what to plot with a high-dimensional data set. Thus dimension reduction shows its importance. Based on information theory, here we develop a dimension reduction technique for extracting important information. Our method in general could be applied in any area relating to information extraction. It involves density estimation which can be estimated non-parametrically. And that is feasible with the help of fast computing techniques. We provide theoretical justification for connecting with central subspace. Illustrative examples are presented.