Main navigation | Main content

HOME » SCIENTIFIC RESOURCES » Volumes

Abstracts and Talk Materials

March 26, 2012

Capturing complex interactions among a large set of variables is a
challenging task. Probabilistic graphical models decouple these interactions into two parts, viz., structural or qualitative relationships represented by a graph, and parametric or quantitative relationships represented by values assigned to different groups of nodes. Graph estimation is an important task since it reveals useful relationships between the variables, but is challenging in high dimensions for loopy graphical models. Another important aspect that needs to be incorporated in modeling is the presence of latent or hidden variables.
In this talk, I will present models and methods for graph estimation in loopy graphical models with latent variables. I will give an overview of models and regimes, where graph estimation is tractable as well as present the strong theoretical guarantees proven for our proposed method. I will then present experimental outcome on newsgroup dataset, where it is seen that latent variables, corresponding to topics, are efficiently discovered, and incorporating loops in the model relaxes the requirement of a strict hierarchy between topics and words.

Completion or imputation of three-way data arrays with missing entries
is a basic problem encountered in various areas, including bioinformatics,
image processing, and preference analysis. If available,
prior information about the data at hand should be incorporated to
enhance performance of the imputation method adopted. This is
the motivation behind the proposed low-rank tensor estimator which
leverages the correlation across slices of the data cube in the form
of reproducing kernels. The rank of the tensor estimate is controlled
by a novel regularization on the factors of its PARAFAC decomposition.
Such a regularization is inspired by a reformulation of the
nuclear norm for matrices, which allows to bypass the challenge that
rank and singular values of tensors are unrelated quantities. The proposed
technique is tested on MRI data of the brain with 30% missing
data, resulting in a recovery error of −17dB.

March 28, 2012

We describe a simple and efficient method - the « Louvain method » - for the detection of communities in networks. The method has sub-quadratic computational complexity and can be routinely used for networks with billions of nodes or links ; the method was chosen a couple of months ago by LinkedIn for its network visualization tool. As an illustration of the method, we describe the communities obtained on a social network constructed from billions of mobile phone communications at country scale. The low complexity of the method allows the analysis of large time-evolving networks and we illustrate this with a vizualisation of how a country-wide social network evolve over a six months period.

In this talk we revisit (high dimensional) sparse regression -- the topic of much recent study and attention. Unlike the standard setting where covariates (the sensing matrix entries) are assumed to be known perfectly and the output (the response variables) free of persistent errors/corruption, real world applications may be plagued by both. Meanwhile, it is well known that while computationally efficient and statistically consistent in the standard setting, the presence of even a single corrupted point is enough to destroy the performance of regression algorithms like Lasso, Orthogonal Matching Pursuit, and others.

We consider support recovery in the case of arbitrary corruptions in both the response variables and the covariates, and ask how many such corruptions we can tolerate, compared to the total number of observations, while still allowing support recovery. We show that this is significantly more challenging than corruption only in the response variable. Indeed, to the best of our knowledge, there are no techniques that provide guarantees this setting; in particular, EM based algorithms, as well as Lasso, OMP and the recently analyzed Justice Pursuit, fail. We provide a robust regression algorithm that is simple, tractable, and can correct an increasing number of such arbitrarily corrupted points. Perhaps surprisingly, we show that the natural (and exponential time) brute force algorithm that minimizes regression error over all possible subsets of points and support, performs significantly worse than our efficient algorithm, in terms of support recovery.

En route, we discuss the challenges of robust statistics in high dimensions, the reason classical approaches fail in this regime, and, perhaps interestingly, suggest why robust regression is fundamentally more difficult than robust PCA.

We consider support recovery in the case of arbitrary corruptions in both the response variables and the covariates, and ask how many such corruptions we can tolerate, compared to the total number of observations, while still allowing support recovery. We show that this is significantly more challenging than corruption only in the response variable. Indeed, to the best of our knowledge, there are no techniques that provide guarantees this setting; in particular, EM based algorithms, as well as Lasso, OMP and the recently analyzed Justice Pursuit, fail. We provide a robust regression algorithm that is simple, tractable, and can correct an increasing number of such arbitrarily corrupted points. Perhaps surprisingly, we show that the natural (and exponential time) brute force algorithm that minimizes regression error over all possible subsets of points and support, performs significantly worse than our efficient algorithm, in terms of support recovery.

En route, we discuss the challenges of robust statistics in high dimensions, the reason classical approaches fail in this regime, and, perhaps interestingly, suggest why robust regression is fundamentally more difficult than robust PCA.

The Arab Spring was a period of social change through a series of protests and revolutions set against a backdrop of revolutions. During this time, the actors and issues were changing. This talk describes how dynamic network analysis, combining network based text mining with network analysis support the assessment of this change. News data on 18 countries for a period of 1 year are assessed and changes in the key actors, secondary actors, and issues identified. The structural analysis shows a stationarity in the incumbent power structure in contrast to volatility in the structure of the revolution.

Over the last ten years, a number of methodologies have been developed which leverage topological techniques and ways of thinking to provide understanding of point cloud data. These include ways of "measuring" shape via homological signatures, topological mapping techniques, and applications of certain kinds of diagram constructions to collections of samples. I will survey these methods, and propose some direct connections of these ideas with mainstream machine learning.

Deducing the state or structure of a system from partial, noisy measurements is a fundamental task throughout the sciences and engineering. The resulting inverse problems are often ill-posed because there are fewer measurements available than the ambient dimension of the model to be estimated. In practice, however, many interesting signals or models contain few degrees of freedom relative to their ambient dimension: a small number of genes may constitute the signature of a disease, very few parameters may specify the correlation structure of a time series, or a sparse collection of geometric constraints may determine a molecular configuration. Discovering, leveraging, or recognizing such low-dimensional structure plays an important role in making inverse problems well-posed. Examples of structured models include previously studied cases such as sparse signals and low-rank matrices, as well as others such as low-rank tensors, binary vectors, orthogonal matrices, and matrices formed as the sum of a few permutations. Inspired by the success of the L1-norm and nuclear-norm heuristics, we propose a general convex relaxation framework to recover such simple structured models from partial information. We provide sharp estimates of the number of generic measurements required for exact and robust recovery in a variety of settings. These estimates are based on computing certain Gaussian statistics related to the underlying model geometry. Thus our work extends the catalog of objects and structures (beyond sparse vectors and low-rank matrices) that can be recovered from limited information via tractable convex programming.

Joint work with Benjamin Recht, Pablo Parrilo, and Alan Willsky.

Joint work with Benjamin Recht, Pablo Parrilo, and Alan Willsky.

We present a series of stochastic gradient methods for solving convex
regularized stochastic optimization problems. In contrast to most existing
algorithms with only one proximal mapping, our approaches adopt two
proximal mappings at each iteration and the final solution is directly
obtained from the second proximal mapping. With this double projection
technique, we can show a uniformly-optimal expected convergence rate for
the final solution rather than the averaged iterates. Therefore, when
applied to sparse learning problems, our algorithms have the advantage of
naturally generating sparse solutions. Furthermore, we establish an
improved convergence rate for solving strongly convex problems. We also
establish the variance for the error of the objective value and high
probability bounds.

This work examines the problem of ranking a collection of objects using pairwise comparisons (rankings of two objects). In general, the ranking of $n$ objects can be identified by standard sorting methods using $nlog_2 n$ pairwise comparisons. We are interested in natural situations in which relationships among the objects may allow for ranking using far fewer pairwise comparisons. {Specifically, we assume that the objects can be embedded into a $d$-dimensional Euclidean space and that the rankings reflect their relative distances from a common reference point in $R^d$. We show that under this assumption the number of possible rankings grows like $n^{2d}$ and demonstrate an algorithm that can identify a randomly selected ranking using just slightly more than $dlog n$ adaptively selected pairwise comparisons, on average.} If instead the comparisons are chosen at random, then almost all pairwise comparisons must be made in order to identify any ranking. In addition, we propose a robust, error-tolerant algorithm that only requires that the pairwise comparisons are probably correct. Experimental studies with synthetic and real datasets support the conclusions of our theoretical analysis.

December 31, 1969

The discovery of knowledge in large data sets can often be formulated as a problem in
nonlinear function approximation. The inherent challenge in such an approach is that the data is often
high dimensional, scattered, not stationary and sparse. Ideally, a good model will also be able to
adapt and remain valid over extended regions in space and time. Our state of the art multivariate
black-box algorithm that automatically fits multi-scale nonlinear functions over high dimensional
domains which does not require ad-hoc parameters is presented. The performance on prediction of
financial time series and hurricane intensity is demonstrated. skew radial basis functions to achieve
more accuracy and lower model orders in empirical modeling when there are sharp transitions in the
data is presented. The novel time varying RBFs and techniques for parsimonious modeling of spatiotemporal
dynamics over high dimensional domains are presented, in the context of equation-free
frame work. The fundamental role of non-linear function approximation techniques for value/policy
function approximation, in the context of approximate dynamic programming is highlighted. We
consider stochastic energy resource allocations as our application for this study.

December 31, 1969

Statistical methods which deal with feature selection and model fitting simultaneously (such as the LASSO) have been studied extensively for linear models. However in many applications, the relationship between predictors and the response are nonlinear and there may be interaction effects between predictors.

We present sparse functional ANOVA models for modeling the relationship between predictors and the response. A sparse ANOVA model decomposes the predictive function into main effects and interactions. Our goal is to select a few predictors and select a few main effects and/or interactions. Feature selection, function component selection and parameter estimation are performed simultaneously with a boosting procedure. We compare our Sparse ANOVA method with the MARS and the COSSO in simulations and real examples. The Sparse ANOVA gives very competitive performances in these studies, especially for high-dimensional data.

We present sparse functional ANOVA models for modeling the relationship between predictors and the response. A sparse ANOVA model decomposes the predictive function into main effects and interactions. Our goal is to select a few predictors and select a few main effects and/or interactions. Feature selection, function component selection and parameter estimation are performed simultaneously with a boosting procedure. We compare our Sparse ANOVA method with the MARS and the COSSO in simulations and real examples. The Sparse ANOVA gives very competitive performances in these studies, especially for high-dimensional data.

We address the problem of learning in an online setting where the
learner repeatedly observes features, selects among a set of actions,
and receives reward for the action taken. We provide the first
efficient algorithm with an optimal regret. Our algorithm uses a cost
sensitive classification learner as an oracle and has a running time
polylog(N), where N is the number of classification rules among
which the oracle might choose. This is exponentially faster than all
previous algorithms that achieve optimal regret in this setting. Our
formulation also enables us to create an algorithm with regret that is
additive rather than multiplicative in feedback delay as in all
previous work.

We address the problem of learning in an online setting where the
learner repeatedly observes features, selects among a set of actions,
and receives reward for the action taken. We provide the first
efficient algorithm with an optimal regret. Our algorithm uses a cost
sensitive classification learner as an oracle and has a running time
polylog(N), where N is the number of classification rules among
which the oracle might choose. This is exponentially faster than all
previous algorithms that achieve optimal regret in this setting. Our
formulation also enables us to create an algorithm with regret that is
additive rather than multiplicative in feedback delay as in all
previous work.

Joint work with Miroslav Dudik, Daniel Hsu, Nikos Karampatziakis, John Langford, Lev Reyzin and Tong Zhang.

Joint work with Miroslav Dudik, Daniel Hsu, Nikos Karampatziakis, John Langford, Lev Reyzin and Tong Zhang.

Online social media represent a fundamental shift of how information
is being produced, transferred and consumed. User generated content in
the form of blog posts, comments, and tweets establishes a connection
between the producers and the consumers of information. Tracking the
pulse of the social media outlets, enables companies to gain feedback
and insight in how to improve and market products better. For
consumers, the abundance of information and opinions from diverse
sources helps them tap into the wisdom of crowds, to aid in making
more informed decisions.

The talk investigates machine learning techniques for modeling social networks and social media. First part will discuss methods for extracting and tracking information as it spreads among users. We will examine methods for extracting temporal patterns by which information popularity grows and fades over time. We show how to quantify and maximize the influence of media outlets on the popularity and attention given to particular piece of content, and how to build predictive models of information diffusion and adoption. Second part will focus on models for extracting structure from social networks and predicting emergence of new links in social networks. In particular, we will examine methods based on Supervised Random Walks for learning to rank nodes on a graph and consequently recommend new friendships in social networks. We will also consider the problem of detecting dense overlapping clusters in networks and present efficient model based methods for network community detection.

The talk investigates machine learning techniques for modeling social networks and social media. First part will discuss methods for extracting and tracking information as it spreads among users. We will examine methods for extracting temporal patterns by which information popularity grows and fades over time. We show how to quantify and maximize the influence of media outlets on the popularity and attention given to particular piece of content, and how to build predictive models of information diffusion and adoption. Second part will focus on models for extracting structure from social networks and predicting emergence of new links in social networks. In particular, we will examine methods based on Supervised Random Walks for learning to rank nodes on a graph and consequently recommend new friendships in social networks. We will also consider the problem of detecting dense overlapping clusters in networks and present efficient model based methods for network community detection.

Multivariate Gaussian data are completely characterized by their mean and
covariance but higher-order cumulants are unavoidable in non-Gaussian
data. For univariate data, these are well-studied via skewness and
kurtosis but for multivariate data, these cumulants are tensor-valued ---
higher-order analogs of the covariance matrix capturing higher-order
dependence in the data. We argue that multivariate cumulants may be
studied via their principal components, defined in a manner analogous to
the usual principal components of a covariance matrix. It is best viewed
as a subspace selection method that accounts for higher-order dependence
the way PCA obtains varimax subspaces. A variant of stochastic gradient
descent on the Grassmannian permits us to estimate principal components of
cumulants of any order in excess of 10,000 dimensions readily on a laptop
computer. This is joint work with Jason Morton.

December 31, 1969

Using Hyperspectral Imaging (HSI) with its capacity of sampling the electromagnetic spectrum using hundreds of narrow and contiguous spectral bands is a valuable remote sensing technology for many defense and security applications. Noise introduced by the Hyperspectral sensor and the interfering medium, in addition to the natural spectral variability of the materials, make information extraction more difficult. Pre-processing methods for image enhancement that take full advantage of available spectral information are proposed here to enhance spatial features of interest and denoising. Tensor Anisotropic Nonlinear Diffusion (TAND) is one way to deal with the problem of denoising a HSI. The idea is to use the structure tensor to find the direction where the edges that separate the regions in the scene are. The tensor information is then used either to prevent diffusion from happening across edges (Edge Enhancement Diffusion (EED)) or to enhance or complete the edges or other linear structures (Coherence Enhancement Diffusion (CED)). A new structure tensor is proposed here that gives higher weight to spectral bands with edge information. This variable weighting improves EED and CED enhancement. Results of this method are presented for HSI images of different spatial characteristics to demonstrate the potential capabilities of the method. The proposed image enhancement algorithm can be an integral part of a target detection system using HSI.

Approximating the k-means clustering objective with an online learning algorithm is an open problem. We introduce a family of online clustering algorithms by extending algorithms for online supervised learning, with access to expert predictors, to the unsupervised learning setting. Instead of computing prediction errors in order to re-weight the experts, the algorithms compute an approximation to the current value of the k-means objective obtained by each expert.

When the experts are batch clustering algorithms with b-approximation guarantees with respect to the k-means objective (for example, the k-means++ or k-means# algorithms), applied to a sliding window of the data stream, our algorithms obtain approximation guarantees with respect to the k- means objective. The form of these online clustering approximation guarantees is novel, and extends an evaluation framework proposed by Dasgupta as an analog to regret. Notably, our approximation bounds are with respect to the optimal k-means cost on the entire data stream seen so far, even though the algorithm is online. Our algorithm’s empirical performance tracks that of the best clustering algorithm in its expert set.

This is joint work with Anna Choromanska.

When the experts are batch clustering algorithms with b-approximation guarantees with respect to the k-means objective (for example, the k-means++ or k-means# algorithms), applied to a sliding window of the data stream, our algorithms obtain approximation guarantees with respect to the k- means objective. The form of these online clustering approximation guarantees is novel, and extends an evaluation framework proposed by Dasgupta as an analog to regret. Notably, our approximation bounds are with respect to the optimal k-means cost on the entire data stream seen so far, even though the algorithm is online. Our algorithm’s empirical performance tracks that of the best clustering algorithm in its expert set.

This is joint work with Anna Choromanska.

Machine learning researchers focus on two distinct learning scenarios for structured graph and network data. In one scenario, the domain consists of a population of structured examples (e.g., chemical compounds) and we can reason about learning algorithms asymptotically, as the number of structured examples increases. In the other scenario, the domain consists of a single, potentially infinite-sized network (e.g., Facebook). In these "single network" domains, an increase in data corresponds to acquiring a larger portion of the underlying network. Even when there are a set of network samples available for learning, they correspond to subnetworks drawn from the same underlying network and thus may be dependent.

Although statistical relational learning have been successfully applied for social network classification tasks, the algorithms were initially developed based on an implicit assumption of an underlying population of networks---which does not hold for most social network datasets. In this talk, I will present our recent efforts to outline a more formal foundation for single network learning and discuss how the analysis has informed the development of more accurate estimation, inference, and evaluation methods.

Although statistical relational learning have been successfully applied for social network classification tasks, the algorithms were initially developed based on an implicit assumption of an underlying population of networks---which does not hold for most social network datasets. In this talk, I will present our recent efforts to outline a more formal foundation for single network learning and discuss how the analysis has informed the development of more accurate estimation, inference, and evaluation methods.

This talk will deal with the notions of adaptive and non-adaptive information, in the context of statistical learning and inference. Suppose that we have a collection of models (e.g., signals, systems, representations, etc.) denoted by X and a collection of measurement actions (e.g., samples, probes, queries, experiments, etc.) denoted by Y. A particular model x in X best describes the problem at hand and is measured as follows. Each measurement action, y in Y, generates an observation y(x) that is a function of the unknown model. This function may be deterministic or stochastic. The goal is to identify x from a set of measurements y_1(x),...,y_n(x), where y_i in Y, i=1,...,n. If the measurement actions y_1,...,y_n are chosen deterministically or randomly without knowledge of x, then the measurement process is non-adaptive. However, If y_i is selected in a way that depends on the previous measurements y_1(x),...,y_{i-1}(x), then the process is adaptive. Adaptive information is clearly more flexible, since the process can always disregard previously collected data. The advantage of adaptive information is that it can sequentially focus measurements or queries to distinguish the elements of X that are most consistent with previously collected data, and this can dramatically reduce the number of measurements required for reliable decisions. The idea of adaptive information gathering is commonplace (e.g., humans and animals excel at this), but outside of simple parametric settings fairly little is known about the fundamental limits and capabilities of such systems. The key question of interest here is identifying situations in which adaptive information is significantly more effective than non-adaptive information. The answer depends on the interrelationship between the model and measurement spaces X and Y. The talk will cover the general problem, and more deeply consider two illustrative examples from machine learning.

In this work, we explore methods for the inverse process to dimensionality reduction. That is, the recovery of a high-dimensional observation from its low-dimensional counterpart such that the inherent structure in the data is preserved. Processing high-dimensional data in the ambient space quickly becomes computationally expensive and often impossible. As a result, methods have been developed to reduce the dimensionality of the data to a level where the desired operations can be performed. In cases where the operation of interest produces an out-of-sample point, in the low-dimensional space, one does not have tools to reproduce the out-of-sample point in the ambient space. We propose a method for recovering a best-fit high-dimensional realization from its low-dimensional counterpart. We focus on synthetic data that exhibits a linear relationship between the high-dimensional observations and the low-dimensional latent variables. We have performed experiments on data drawn form statistical distributions and low-dimensional smooth manifolds. Our results suggest proof of concept and our methods are being extended to data that exhibits non-linear relationships.

Modern problems across science and engineering increasingly require high-dimensional modeling: models with more parameters than observations. It is now well understood that statistically reliable inference is still possible under such high-dimensional settings, provided one restricts to constrained subclasses of models with particular low-dimensional structure. Examples include linear regression with sparsity constraints (compressed sensing), estimation of covariance or inverse covariance matrices, sparse principal component analysis, low-rank matrix estimation, and sparse additive non-parametric models. Over the past decade, there has been a strong body of work that have proposed statistical estimators for infering such structurally constrained high-dimensional models, with strong statistical guarantees.

In this talk, we consider the computational facet of such estimation: could one provide a general computational scheme to solve any of the convex optimization problems that arise in such high-dimensional inference? We find that such a general computational scheme is indeed possible: we discuss a scheme based on a greedy strategy. Our framework not only unifies existing greedy algorithms that have proposed for such high-dimensional problems by recovering them as special cases but also yields novel ones.

Based on joint work with Inderjit Dhillon and Ambuj Tewari.

In this talk, we consider the computational facet of such estimation: could one provide a general computational scheme to solve any of the convex optimization problems that arise in such high-dimensional inference? We find that such a general computational scheme is indeed possible: we discuss a scheme based on a greedy strategy. Our framework not only unifies existing greedy algorithms that have proposed for such high-dimensional problems by recovering them as special cases but also yields novel ones.

Based on joint work with Inderjit Dhillon and Ambuj Tewari.

Stochastic Gradient Descent (SGD) is a popular optimization algorithm for solving data-driven machine learning problems such as classification, model selection, sequence labeling, and recommendation. SGD is well suited to processing large amounts of data due to its robustness against noise, rapid convergence rates, and predictable memory footprint. Nevertheless, SGD seems to be impeded by many classical barriers to scalability: (1) SGD appears to be inherently sequential, (2) SGD assumes uniform sampling from the underlying data set resulting in poor locality, and (3) current approaches to parallelize SGD require performance-destroying, fine-grained communication.

This talk aims to refute the conventional wisdom that SGD inherently suffers from these impediments. Specifically, I will show that SGD can be implemented in parallel with minimal communication, with no locking or synchronization, and with strong spatial locality. I will provide both theoretical and experimental evidence demonstrating the achievement of linear speedups on multicore workstations on several benchmark optimization problems. Finally, I will close with a discussion of a challenging problem raised by our implementations relating arithmetic and geometric means of positive definite matrices.

Joint work with Feng Niu, Christopher Re, and Stephen Wright.

This talk aims to refute the conventional wisdom that SGD inherently suffers from these impediments. Specifically, I will show that SGD can be implemented in parallel with minimal communication, with no locking or synchronization, and with strong spatial locality. I will provide both theoretical and experimental evidence demonstrating the achievement of linear speedups on multicore workstations on several benchmark optimization problems. Finally, I will close with a discussion of a challenging problem raised by our implementations relating arithmetic and geometric means of positive definite matrices.

Joint work with Feng Niu, Christopher Re, and Stephen Wright.

Although the network clustering literature has focused on undirected networks, many networks are directed. For example, communication networks contain asymmetric relationships, representing the flow of information from one person to another. This talk will (1) demonstrate that co-clustering, instead of clustering, is more natural for many directed graphs, (2) propose a spectral algorithm and a statistical model for co-clustering, (3) show some asymptotic results, and (4) present a preliminary analysis of a citation network from Arxiv. Of key interest is the discovery of bottleneck nodes that transmit information between clusters of papers. This is joint work with Professor Bin Yu at UC Berkeley.

I am working on the design of predictive models that are both accurate and interpretable by a human. These models are built from association rules such as "dyspepsia & epigastric pain -> heartburn." Association rules are commonly used in the data mining community for database exploration, but have not been heavily employed in machine learning or statistics for prediction problems. I will present three algorithms for "decision lists," where classification is based on a list of rules:

1) A very simple Bayesian rule-based algorithm, which is to order rules based on the "adjusted confidence." In this case, users can understand the whole algorithm as well as the reason for the prediction. (This algorithm has a foundation in statistical learning theory, though I will not focus on this during the talk.)

2) A Bayesian hierarchical model for sequentially predicting conditions of medical patients, using association rules. This is essentially a recommender system for medical conditions. The model allows us to borrow strength from similar patients when there are not enough data available about a given patient. This is a hierarchical version of the adjusted confidence algorithm from the first topic.

3) A mixed-inter optimization (MIO) approach for learning decision lists. This algorithm has high accuracy and interpretability - both owing to the use of MIO. In our experiments, this algorithm has interpretability on par with decision trees, and has accuracy on par with boosted decision trees and SVM's with Gaussian kernels. The algorithm has regularization both on the sizes of rules, and also on the total number of rules in the list, leading to small lists.

This is joint work with David Madigan, Tyler McCormick, Allison Chang, and Dimitris Bertsimas.

Publications/Drafts of these works are here: topic 1: http://web.mit.edu/rudin/www/RudinLeKoMaSSRN11.pdf topic 2: http://web.mit.edu/rudin/www/McCormickRuMa11OR38511.pdf topic 3: http://web.mit.edu/rudin/www/BertsimasChRuOR38611.pdf

1) A very simple Bayesian rule-based algorithm, which is to order rules based on the "adjusted confidence." In this case, users can understand the whole algorithm as well as the reason for the prediction. (This algorithm has a foundation in statistical learning theory, though I will not focus on this during the talk.)

2) A Bayesian hierarchical model for sequentially predicting conditions of medical patients, using association rules. This is essentially a recommender system for medical conditions. The model allows us to borrow strength from similar patients when there are not enough data available about a given patient. This is a hierarchical version of the adjusted confidence algorithm from the first topic.

3) A mixed-inter optimization (MIO) approach for learning decision lists. This algorithm has high accuracy and interpretability - both owing to the use of MIO. In our experiments, this algorithm has interpretability on par with decision trees, and has accuracy on par with boosted decision trees and SVM's with Gaussian kernels. The algorithm has regularization both on the sizes of rules, and also on the total number of rules in the list, leading to small lists.

This is joint work with David Madigan, Tyler McCormick, Allison Chang, and Dimitris Bertsimas.

Publications/Drafts of these works are here: topic 1: http://web.mit.edu/rudin/www/RudinLeKoMaSSRN11.pdf topic 2: http://web.mit.edu/rudin/www/McCormickRuMa11OR38511.pdf topic 3: http://web.mit.edu/rudin/www/BertsimasChRuOR38611.pdf

Read More...

Optimizing multivariate performance measure is an important task in Machine Learning. Joachims (2005) introduced a Support Vector Method whose underlying optimization problem is commonly solved by cutting plane methods (CPMs) such as SVM-Perf and BMRM. It can be shown that CPMs converge to an accurate solution in $O(1/lambda epsilon)$ iterations, where $lambda$ is the trade-off parameter between the regularizer and the loss function. Motivated by the impressive convergence rate of CPM on a number of practical problems, it was conjectured that these rates can be further improved. We disprove this conjecture in this paper by constructing counter examples. However, surprisingly, we further discover that these problems are not inherently hard, and we develop a novel smoothing strategy, which in conjunction with Nesterov’s accelerated gradient method, can find an accurate solution in $O^*( min {1/epsilon , 1/sqrt{lambda epsilon}})$ iterations. Computationally, our smoothing technique is also particularly advantageous for optimizing multivariate performance scores such as precision/recall break-even point and ROCArea; the cost per iteration remains the same as that of CPMs. Empirical evaluation on some of the largest publicly available datasets shows that our method converges significantly faster than CPMs without sacrificing generalization ability.

In multivariate analysis, rank minimization emerges when a low-rank
structure of matrices is desired in addition to a small estimation error.
Rank minimization is nonconvex and generally NP-hard. In this talk,
I will present some of our recent results on rank minimization.
Computationally, we derive a closed-form for a
global minimizer of a nonconvex least squares problem, as well as
develop efficient algorithms to compute a global solution as well as an entire
regularization solution path. Theoretically, we show that our method
reconstructs the oracle estimator exactly for noisy data.
Finally, the utility of the proposed method is demonstrated by
simulations and image reconstruction from noisy background.

This work is joint with Shuo Xiang, Yunzhang Zhu and Jieping Ye.

This work is joint with Shuo Xiang, Yunzhang Zhu and Jieping Ye.

Inverse reinforcement learning is now a well established method for creating controllers by imitation learning. It is less often used for predicting behavior, and only rarely for understanding and interpreting behavior. We seek to improve throughput of surveillance data processing by using IRL techniques to detect anomalous actions for the analyst, together with a plausible explanation for the behavior. For this purpose, real-world interpretations must be associated with the basis functions in the linear reward function approximation. We create a practical scheme where the basis is built up in user interactions triggered by false alarms, thus yielding a small and relevant set of basis functions.
Experiments revealed limited rationality as the most important barrier to this approach to anomaly detection. The traditional IRL approach to limited rationality is to assume the agent is making noisy decisions with perfect knowledge of the value function. We argue that in reality, the exact value function is (cannot) be known and that people are in fact good at choosing the value-function-maximizing action under believed cost, but have misperceptions of the cost. Thus, we present a formulation of the IRL problem which trades off the planning margin (maximal rationality) and accuracy of action cost perception.

Statistical inference for large scale factorization and latent
variable model problems is challenging. It requires the ability to
partition the state space, to synchronize copies, and to perform
distributed updates. Such problems arise in very large scale topic
models dealing with 500 million documents, and in graph factorization
problems with 200 million vertices.

This talk describes basic tools from systems research for distributing data and computation over hundreds of computers and how to synchronize updates efficiently. We argue in favor of asynchronous updates, both from a systems design and from an experimental point of view. In particular, we show how a distributed approximate Gibbs sampler can be implemented for time-dependent latent variable models and how the method of multipliers can be adapted for large scale graph factorization.

This talk describes basic tools from systems research for distributing data and computation over hundreds of computers and how to synchronize updates efficiently. We argue in favor of asynchronous updates, both from a systems design and from an experimental point of view. In particular, we show how a distributed approximate Gibbs sampler can be implemented for time-dependent latent variable models and how the method of multipliers can be adapted for large scale graph factorization.

I will discuss deep connections between Statistical Learning, Online
Learning and Optimization. I will show that there is a tight
correspondence between the sample size required for learning and the
number of local oracle accesses required for optimization, and the
same measures of "complexity" (e.g. the fat-shattering dimension or
Rademacher complexity) control both of them. Furthermore, I will show
how the Mirror Descent method, and in particular its stochastic/online
variant, is in a strong sense "universal" for online learning,
statistical learning and optimization. That is, for a general class
of convex learning/optimization problems, Mirror Descent can always
achieve a (nearly) optimal guarantee. In the context of statistical
learning, this also implies that for a broad generic class of convex
problems, learning can be done optimally (in the worst-case
agnostic-PAC sense) with a single pass over the data.

We develop approaches to generate dynamical reservoir weights of Echo State Networks with desired properties reducing the amount of randomness. It is possible to create weight matrices with a predefined singular value spectrum. The procedure guarantees stability (Echo State property). We prove the minimization of the impact of noise on the training process. The resulting reservoir types are strongly related to in the literature already known reservoirs. Our experiments show, that well chosen input weights can improve performance.

Joined work with Roger Labahn and Welf Wustlich.

Joined work with Roger Labahn and Welf Wustlich.

We propose a convex optimization based method for tensor
decomposition, and analyze its statistical performance.
Conventionally tensor decomposition has been formulated as
non-convex optimization problems, which hindered the
analysis of their performance. We show under some conditions
that both the performance of noisy tensor decomposition and
tensor completion can be predicted by the quantity we call the
normalized rank. Numerical experiments show that our theory
can precisely predict the scaling behavior in practice. The
current analysis naturally extends the analysis of convex low-rank
matrix estimation to tensors. We also discuss some limitations
of our theory, open issues, and possible extensions.

Automated unmixing consists of finding the number of endmembers, their spectral signatures, and their abundances from a hyperspectral image. Most unmixing techniques are pixel–based procedures that do not take advantage of spatial information provided by hyperspectral image. This poster presents a new approach for unmixing analysis of hyperspectral imagery using a multiscale representation method based on nonlinear diffusion for the joint estimation of the number of endmember and their spectral signatures. A set of endmember is identified using the multiscale representation, then this set of endmembers is organized into endmember classes using clustering techniques. Abundances can be estimated using one of several available methods. Experimental results are presented using an AVIRIS image from AP Hill and the False Leaf image collected with the SOC 700 hyperspectral camera.

December 31, 1969

Often, to use a reservoir model as the basis for investment decisions, we desire it to be calibrated to field measurements. This process of model calibration is known as history matching. History matching typically requires solving an ill-posed inverse problem. It is inherently non-unique and requires a thorough uncertainty assessment. Here, we present an approach that relies on partitioning the parameter space for detailed exploration to identify multiple plausible history-matched models. We use workflows that integrate several machine learning methods for exploring the individual partitions to assess the feasibility of obtaining history-matched models that can be used further for production forecast. We illustrate the utility of the approach by applying it to synthetic reservoir model case study.

The group lasso is an extension of the popular lasso regression method which allows to select predefined groups of features jointly in the context of regression or supervised classification. I will discuss two extensions of the group lasso, motivated by applications in genomic data analysis. First, I will present a new fast method for multiple change-point detection in multidimensional signals, which boils down to a group Lasso regression problem and allows to detect frequent breakpoint location in DNA copy number profiles with millions of probes. Second, I will discuss the latent group lasso, an extension of the group lasso when groups can overlap, which enjoys interesting consistency properties and can be helpful for structured feature selection in high-dimensional gene expression data analysis for cancer prognosis. (Joint work with Kevin Bleakley, Guillaume Obozinski and Laurent Jacob)

If a discrete probability distribution in a model being tested for
goodness-of-fit is not close to uniform, forming the Pearson chi-square
statistic often involves renormalizing summands to different scales in
order to uniformize the asymptotic distribution. This often leads to
serious trouble in practice -- even in the absence of round-off errors --
as the talk will illustrate via numerous examples. Fortunately with the
now widespread availability of
computers, avoiding all the trouble is simple and easy: without
renormalization, the actual values taken by goodness-of-fit statistics are
not humanly interpretable,
but black-box computer programs can rapidly calculate their precise
significance.

http://arxiv.org/abs/1108.4126

(joint work with Will Perkins and Mark Tygert)

http://arxiv.org/abs/1108.4126

(joint work with Will Perkins and Mark Tygert)

We extend the classical problem of predicting a sequence of
outcomes from a finite alphabet to the matrix domain. In
this extension, the alphabet of n outcomes is replaced by
the set of all dyads, i.e. outer products uu' where u is a
vector in R^n of unit length. Whereas in the classical case
the goal is to learn (i.e. sequentially predict as well as)
the best multinomial distribution, in the matrix case we
desire to learn the density matrix that best explains the
observed sequence of dyads. We show how popular online
algorithms for learning a multinomial distribution can be
extended to learn density matrices. Intuitively, learning
the n^2 parameters of a density matrix is much harder than
learning the n parameters of a multinomial distribution.
Completely surprisingly, we prove that the worst-case
regrets of certain classical algorithms and their matrix
generalizations are identical. The reason is that the
worst-case sequence of dyads share a common eigensystem,
i.e. the worst case regret is achieved in the classical
case. So these matrix algorithms learn the eigenvectors
without any regret.

Joint work with Wouter M. Koolen and Wojtek Kotlowski that appeared in NIPS 2011

Joint work with Wouter M. Koolen and Wojtek Kotlowski that appeared in NIPS 2011

March 27, 2012

Multiplicative updates multiply the parameters by
nonnegative factors. These updates are motivated by
a Maximum Entropy Principle and they are prevalent in evolutionary
processes where the parameters are for example
concentrations of species and the factors are survival rates.
The simplest such update is Bayes rule and we give
an in vitro selection algorithm for RNA strands that
implements this rule in the test tube where
each RNA strand represents a different model. In one liter of the RNA soup there are approximately 10^20 different strands
and therefore this is a rather high-dimensional implementation of Bayes rule.

We investigate multiplicative updates for the purpose of learning online while processing a stream of examples. The ``blessing'' of these updates is that they learn very fast because the good parameters grow exponentially. However their ``curse'' is that they learn too fast and wipe out parameters too quickly. We describe a number of methods developed in the realm of online learning that ameliorate the curse of these updates. The methods make the algorithm robust against data that changes over time and prevent the currently good parameters from taking over. We also discuss how the curse is circumvented by nature. Some of nature's methods parallel the ones developed in Machine Learning, but nature also has some additional tricks.

This will be a high level talk. No background in online learning or evolutionary Biology will be required.

We investigate multiplicative updates for the purpose of learning online while processing a stream of examples. The ``blessing'' of these updates is that they learn very fast because the good parameters grow exponentially. However their ``curse'' is that they learn too fast and wipe out parameters too quickly. We describe a number of methods developed in the realm of online learning that ameliorate the curse of these updates. The methods make the algorithm robust against data that changes over time and prevent the currently good parameters from taking over. We also discuss how the curse is circumvented by nature. Some of nature's methods parallel the ones developed in Machine Learning, but nature also has some additional tricks.

This will be a high level talk. No background in online learning or evolutionary Biology will be required.

Machine learning, computational statistics, and signal reconstruction
in compressed sensing have proved to be a rich source of interesting
and challenging optimization problems in recent years. In these and
other fields, the optimization formulations require the computed
solution to satisfy some structural requirements (such as sparsity or
low total variation), as well as the traditional requirements of
feasibility and low objective value. Although the optimization
problems arising from these areas are quite diverse, there are common
features such as large underlying data sets and high-dimensional
variable spaces that influence the choice of algorithmic tools most
suitable for solving them.

In this tutorial, we start by surveying the space of optimization problems that arise in machine learning and sparse optimization, highlighting the features of each application and the requirements they place on algorithms. We then sketch several fundamental tools from optimization that have proved useful in these applications. These include first-order methods and their accelerated variants, stochastic gradient and incremental gradient methods, augmented Lagrangian, shrinking / thresholding approaches, coordinate relaxation, and higher-order approaches. We also discuss the role of duality and the opportunities for parallel computing.

In this tutorial, we start by surveying the space of optimization problems that arise in machine learning and sparse optimization, highlighting the features of each application and the requirements they place on algorithms. We then sketch several fundamental tools from optimization that have proved useful in these applications. These include first-order methods and their accelerated variants, stochastic gradient and incremental gradient methods, augmented Lagrangian, shrinking / thresholding approaches, coordinate relaxation, and higher-order approaches. We also discuss the role of duality and the opportunities for parallel computing.

Influenza viruses have been responsible for large losses of lives around
the world and continue to present a great public health challenge.
Antigenic characterization based on hemagglutination inhibition (HI) assay
is one of the routine procedures for influenza vaccine strain selection.
However, HI assay is only a crude experiment reflecting the antigenic
correlations among testing antigens and reference antisera. Moreover,
antigenic characterization is usually based on more than one HI dataset.
The combination of multiple datasets results in an incomplete HI matrix
with many unobserved entries. We proposed a computational framework for
constructing an influenza antigenic cartography from incomplete HI matrix
by formulating this problem into a low rank matrix completion algorithm,
which is employed to fill in the entries of the HI matrix and also to
reduce the noises in HI experiments. A MDS algorithm is utilized to map
the antigens (or similarly, antibodies) into a two or three dimensional
space for visualization ([ http://sysbio.cvm.msstate.edu/AntigenMap3D
]sysbio.cvm.msstate.edu/ AntigenMap3D). Alternating Gradient Descent
(AGD) was employed as the matrix completion algorithm. Simulation and the
application of this method in H3N2 data demonstrated the effectiveness and
robustness of this method in influenza antigenic cartography construction.
More advanced low rank matrix completion algorithms are further developed
for this critical biological question.

Xiu-Feng Wan1, Zhi-Peng Cai1, Lamar Barnett1, Zhi-Xia Yang1, Jialiang Yang1, and Tong Zhang2 1Department of Basic Sciences, College of Veterinary Medicine, Mississippi State University, Mississippi State, MS 39762, USA; 2Department of Statistics and Biostatistics, Rutgers University, Piscataway, NJ 08854, USA.

Xiu-Feng Wan1, Zhi-Peng Cai1, Lamar Barnett1, Zhi-Xia Yang1, Jialiang Yang1, and Tong Zhang2 1Department of Basic Sciences, College of Veterinary Medicine, Mississippi State University, Mississippi State, MS 39762, USA; 2Department of Statistics and Biostatistics, Rutgers University, Piscataway, NJ 08854, USA.

We propose a computational framework, which integrates feature selection
methods like lasso and ridge selection and antigenic cartography, to
identify single or coupled mutations from genomic sequence causing
antigenic drift events for influenza viruses. This algorithm was applied
and experimentally proved to be successful in H3N2 and H5N1 Influenza A
virus datasets. By using the identified key mutation sites, we were able
to conduct sequence-based antigenic characterization and predict the
antigenic evolution of influenza viruses.

Xiu-Feng Wan1 , Jialiang Yang1, Hailiang Sun1, Zhipeng Cai1, Mariette F. Ducatez2, Richard J. Webby2, Li-Ping Long1 and Tong Zhang3

1Department of Basic Sciences, College of Veterinary Medicine, Mississippi State University, Mississippi State, MS 39762, USA; 2 Department of Infectious Diseases, St. Jude Children's Research Hospital, 262 Danny Thomas Place, Memphis, TN 38105, USA; 3 Department of Statistics, Rutgers University, Piscataway, NJ 08854, USA

Xiu-Feng Wan1 , Jialiang Yang1, Hailiang Sun1, Zhipeng Cai1, Mariette F. Ducatez2, Richard J. Webby2, Li-Ping Long1 and Tong Zhang3

1Department of Basic Sciences, College of Veterinary Medicine, Mississippi State University, Mississippi State, MS 39762, USA; 2 Department of Infectious Diseases, St. Jude Children's Research Hospital, 262 Danny Thomas Place, Memphis, TN 38105, USA; 3 Department of Statistics, Rutgers University, Piscataway, NJ 08854, USA

December 31, 1969

We introduce a new collection of convex loss functions for estimating
precision matrices, including the likelihood function of Gaussian
graphical model as a special case. Another interesting special case
gives rise to a simple loss function called the D-Trace loss which is
expressed as the difference of two trace operators. We then introduce
a new sparse precision matrix estimator defined as the minimizer of
the ℓ1 penalized D-Trace loss under a positive definite constraint. We
develop a very efficient algorithm based on alternating direction
methods for computing the positive definite constrained ℓ1 penalized
D-Trace loss estimator. Under a new irrepresentable condition our
estimator has the sparse recovery property. We establish rates of
convergence of our estimator under the element-wise maximum norm,
Frobenius norm and operator norm. Simulated and real data are used to
demonstrate the computational efficiency of our algorithm and the
finite sample performance of our estimator. It is shown that our
estimator compares favorably with the ℓ1 penalized MLE.

This is a joint work with Hui Zou

This is a joint work with Hui Zou

For a given set of data points lying on a low-dimensional manifold
embedded in a high-dimensional space, the dimensionality reduction is to
recover a low-dimensional parametrization from the data set. The Hessian
Eigenmaps is a mathematically rigorous method that also sets a theoretical
framework for the nonlinear dimensionality reduction problem. We present a
discrete version of the Hessian Eigenmaps method and provide an analysis,
giving conditions under which the method works as intended. As an
application, a procedure to modify the standard constructions of k-nearest
neighborhoods is presented to ensure that Hessian LLE can recover the
original coordinates up to an affine transformation.

Joint work with Qiang Ye

Joint work with Qiang Ye