HOME    »    SCIENTIFIC RESOURCES    »    Volumes
Abstracts and Talk Materials
Machine Learning: Theory and Computation
March 26 - 30, 2012

Anima Anandkumar (University of California)

Learning Loopy Graphical Models with Latent Variables: Tractable Regimes and Efficient Methods
March 26, 2012

Keywords of the presentation: Graphical models, latent variables, graph estimation, quartet methods.

Capturing complex interactions among a large set of variables is a challenging task. Probabilistic graphical models decouple these interactions into two parts, viz., structural or qualitative relationships represented by a graph, and parametric or quantitative relationships represented by values assigned to different groups of nodes. Graph estimation is an important task since it reveals useful relationships between the variables, but is challenging in high dimensions for loopy graphical models. Another important aspect that needs to be incorporated in modeling is the presence of latent or hidden variables. In this talk, I will present models and methods for graph estimation in loopy graphical models with latent variables. I will give an overview of models and regimes, where graph estimation is tractable as well as present the strong theoretical guarantees proven for our proposed method. I will then present experimental outcome on newsgroup dataset, where it is seen that latent variables, corresponding to topics, are efficiently discovered, and incorporating loops in the model relaxes the requirement of a strict hierarchy between topics and words.

Juan Andres Bazerque (University of Minnesota, Twin Cities)

Poster - Nonparametric Low-Rank Tensor Imputation
December 31, 1969

Completion or imputation of three-way data arrays with missing entries is a basic problem encountered in various areas, including bioinformatics, image processing, and preference analysis. If available, prior information about the data at hand should be incorporated to enhance performance of the imputation method adopted. This is the motivation behind the proposed low-rank tensor estimator which leverages the correlation across slices of the data cube in the form of reproducing kernels. The rank of the tensor estimate is controlled by a novel regularization on the factors of its PARAFAC decomposition. Such a regularization is inspired by a reformulation of the nuclear norm for matrices, which allows to bypass the challenge that rank and singular values of tensors are unrelated quantities. The proposed technique is tested on MRI data of the brain with 30% missing data, resulting in a recovery error of −17dB.

Vincent Blondel (Université Catholique de Louvain)

How to partition a country? Identification of social communities in mobile phone communication networks
March 28, 2012

Keywords of the presentation: complex networks, community detection, mobile phone dataset, data mining

We describe a simple and efficient method - the « Louvain method » - for the detection of communities in networks. The method has sub-quadratic computational complexity and can be routinely used for networks with billions of nodes or links ; the method was chosen a couple of months ago by LinkedIn for its network visualization tool. As an illustration of the method, we describe the communities obtained on a social network constructed from billions of mobile phone communications at country scale. The low complexity of the method allows the analysis of large time-evolving networks and we illustrate this with a vizualisation of how a country-wide social network evolve over a six months period.

Constantine Caramanis (The University of Texas at Austin)

Robust Statistics: Challenges of High Dimensions
March 27, 2012

Keywords of the presentation: robustness, high dimensional statistics, regression, PCA

In this talk we revisit (high dimensional) sparse regression -- the topic of much recent study and attention. Unlike the standard setting where covariates (the sensing matrix entries) are assumed to be known perfectly and the output (the response variables) free of persistent errors/corruption, real world applications may be plagued by both. Meanwhile, it is well known that while computationally efficient and statistically consistent in the standard setting, the presence of even a single corrupted point is enough to destroy the performance of regression algorithms like Lasso, Orthogonal Matching Pursuit, and others.

We consider support recovery in the case of arbitrary corruptions in both the response variables and the covariates, and ask how many such corruptions we can tolerate, compared to the total number of observations, while still allowing support recovery. We show that this is significantly more challenging than corruption only in the response variable. Indeed, to the best of our knowledge, there are no techniques that provide guarantees this setting; in particular, EM based algorithms, as well as Lasso, OMP and the recently analyzed Justice Pursuit, fail. We provide a robust regression algorithm that is simple, tractable, and can correct an increasing number of such arbitrarily corrupted points. Perhaps surprisingly, we show that the natural (and exponential time) brute force algorithm that minimizes regression error over all possible subsets of points and support, performs significantly worse than our efficient algorithm, in terms of support recovery.

En route, we discuss the challenges of robust statistics in high dimensions, the reason classical approaches fail in this regime, and, perhaps interestingly, suggest why robust regression is fundamentally more difficult than robust PCA.

Kathleen Mary Carley (Carnegie-Mellon University)

Dynamic Network Analysis of the Arab Spring
March 29, 2012

Keywords of the presentation: social networks, network science, network-based text mining, dynamic network analysis

The Arab Spring was a period of social change through a series of protests and revolutions set against a backdrop of revolutions. During this time, the actors and issues were changing. This talk describes how dynamic network analysis, combining network based text mining with network analysis support the assessment of this change. News data on 18 countries for a period of 1 year are assessed and changes in the key actors, secondary actors, and issues identified. The structural analysis shows a stationarity in the incumbent power structure in contrast to volatility in the structure of the revolution.

Gunnar Carlsson (Stanford University)

Topological methods for large and complex data sets
March 26, 2012

Keywords of the presentation: topology, data analysis, homology, point clouds

Over the last ten years, a number of methodologies have been developed which leverage topological techniques and ways of thinking to provide understanding of point cloud data. These include ways of "measuring" shape via homological signatures, topological mapping techniques, and applications of certain kinds of diagram constructions to collections of samples. I will survey these methods, and propose some direct connections of these ideas with mainstream machine learning.

Venkat Chandrasekaran (University of California, Berkeley)

The Convex Geometry of Linear Inverse Problems
March 26, 2012

Deducing the state or structure of a system from partial, noisy measurements is a fundamental task throughout the sciences and engineering. The resulting inverse problems are often ill-posed because there are fewer measurements available than the ambient dimension of the model to be estimated. In practice, however, many interesting signals or models contain few degrees of freedom relative to their ambient dimension: a small number of genes may constitute the signature of a disease, very few parameters may specify the correlation structure of a time series, or a sparse collection of geometric constraints may determine a molecular configuration. Discovering, leveraging, or recognizing such low-dimensional structure plays an important role in making inverse problems well-posed. Examples of structured models include previously studied cases such as sparse signals and low-rank matrices, as well as others such as low-rank tensors, binary vectors, orthogonal matrices, and matrices formed as the sum of a few permutations. Inspired by the success of the L1-norm and nuclear-norm heuristics, we propose a general convex relaxation framework to recover such simple structured models from partial information. We provide sharp estimates of the number of generic measurements required for exact and robust recovery in a variety of settings. These estimates are based on computing certain Gaussian statistics related to the underlying model geometry. Thus our work extends the catalog of objects and structures (beyond sparse vectors and low-rank matrices) that can be recovered from limited information via tractable convex programming.

Joint work with Benjamin Recht, Pablo Parrilo, and Alan Willsky.

Xi Chen (Carnegie-Mellon University)

Poster - Uniformly-Optimal Stochastic Gradient Methods with Double Projections
December 31, 1969

We present a series of stochastic gradient methods for solving convex regularized stochastic optimization problems. In contrast to most existing algorithms with only one proximal mapping, our approaches adopt two proximal mappings at each iteration and the final solution is directly obtained from the second proximal mapping. With this double projection technique, we can show a uniformly-optimal expected convergence rate for the final solution rather than the averaged iterates. Therefore, when applied to sparse learning problems, our algorithms have the advantage of naturally generating sparse solutions. Furthermore, we establish an improved convergence rate for solving strongly convex problems. We also establish the variance for the error of the objective value and high probability bounds.

Corinna Cortes (Google Inc.)

Tutorial - Learning Kernels
March 29, 2012

Keywords of the presentation: large-scale learning

Kevin Jamieson (University of Wisconsin, Madison)

Poster - Active Ranking using Pairwise Comparisons
December 31, 1969

This work examines the problem of ranking a collection of objects using pairwise comparisons (rankings of two objects). In general, the ranking of $n$ objects can be identified by standard sorting methods using $nlog_2 n$ pairwise comparisons. We are interested in natural situations in which relationships among the objects may allow for ranking using far fewer pairwise comparisons. {Specifically, we assume that the objects can be embedded into a $d$-dimensional Euclidean space and that the rankings reflect their relative distances from a common reference point in $R^d$. We show that under this assumption the number of possible rankings grows like $n^{2d}$ and demonstrate an algorithm that can identify a randomly selected ranking using just slightly more than $dlog n$ adaptively selected pairwise comparisons, on average.} If instead the comparisons are chosen at random, then almost all pairwise comparisons must be made in order to identify any ranking. In addition, we propose a robust, error-tolerant algorithm that only requires that the pairwise comparisons are probably correct. Experimental studies with synthetic and real datasets support the conclusions of our theoretical analysis.

Arta Jamshidi (Princeton University)

Poster - Recent Developments on High Dimensional Function Approximation and Various Important Applications
December 31, 1969

The discovery of knowledge in large data sets can often be formulated as a problem in nonlinear function approximation. The inherent challenge in such an approach is that the data is often high dimensional, scattered, not stationary and sparse. Ideally, a good model will also be able to adapt and remain valid over extended regions in space and time. Our state of the art multivariate black-box algorithm that automatically fits multi-scale nonlinear functions over high dimensional domains which does not require ad-hoc parameters is presented. The performance on prediction of financial time series and hurricane intensity is demonstrated. skew radial basis functions to achieve more accuracy and lower model orders in empirical modeling when there are sharp transitions in the data is presented. The novel time varying RBFs and techniques for parsimonious modeling of spatiotemporal dynamics over high dimensional domains are presented, in the context of equation-free frame work. The fundamental role of non-linear function approximation techniques for value/policy function approximation, in the context of approximate dynamic programming is highlighted. We consider stochastic energy resource allocations as our application for this study.

Jinzhu Jia (Beijing (Peking) University)

Poster - Feature and Compnent Selection in Multivariate Sparse Anova Models Using Boosting Method
December 31, 1969

Statistical methods which deal with feature selection and model fitting simultaneously (such as the LASSO) have been studied extensively for linear models. However in many applications, the relationship between predictors and the response are nonlinear and there may be interaction effects between predictors.

We present sparse functional ANOVA models for modeling the relationship between predictors and the response. A sparse ANOVA model decomposes the predictive function into main effects and interactions. Our goal is to select a few predictors and select a few main effects and/or interactions. Feature selection, function component selection and parameter estimation are performed simultaneously with a boosting procedure. We compare our Sparse ANOVA method with the MARS and the COSSO in simulations and real examples. The Sparse ANOVA gives very competitive performances in these studies, especially for high-dimensional data.

Satyen Kale (IBM)

Poster - Efficient Optimal Learning for Contextual Bandits
December 31, 1969

We address the problem of learning in an online setting where the learner repeatedly observes features, selects among a set of actions, and receives reward for the action taken. We provide the first efficient algorithm with an optimal regret. Our algorithm uses a cost sensitive classification learner as an oracle and has a running time polylog(N), where N is the number of classification rules among which the oracle might choose. This is exponentially faster than all previous algorithms that achieve optimal regret in this setting. Our formulation also enables us to create an algorithm with regret that is additive rather than multiplicative in feedback delay as in all previous work.

Satyen Kale (IBM)

Efficient Optimal Learning for Contextual Bandits
March 30, 2012

We address the problem of learning in an online setting where the learner repeatedly observes features, selects among a set of actions, and receives reward for the action taken. We provide the first efficient algorithm with an optimal regret. Our algorithm uses a cost sensitive classification learner as an oracle and has a running time polylog(N), where N is the number of classification rules among which the oracle might choose. This is exponentially faster than all previous algorithms that achieve optimal regret in this setting. Our formulation also enables us to create an algorithm with regret that is additive rather than multiplicative in feedback delay as in all previous work.

Joint work with Miroslav Dudik, Daniel Hsu, Nikos Karampatziakis, John Langford, Lev Reyzin and Tong Zhang.

Jure Leskovec (Stanford University)

Tutorial - Part 1: Models of Information Flow and Social Media
March 28, 2012

Keywords of the presentation: social networks, community detection, graph analysis

Online social media represent a fundamental shift of how information is being produced, transferred and consumed. User generated content in the form of blog posts, comments, and tweets establishes a connection between the producers and the consumers of information. Tracking the pulse of the social media outlets, enables companies to gain feedback and insight in how to improve and market products better. For consumers, the abundance of information and opinions from diverse sources helps them tap into the wisdom of crowds, to aid in making more informed decisions.

The talk investigates machine learning techniques for modeling social networks and social media. First part will discuss methods for extracting and tracking information as it spreads among users. We will examine methods for extracting temporal patterns by which information popularity grows and fades over time. We show how to quantify and maximize the influence of media outlets on the popularity and attention given to particular piece of content, and how to build predictive models of information diffusion and adoption. Second part will focus on models for extracting structure from social networks and predicting emergence of new links in social networks. In particular, we will examine methods based on Supervised Random Walks for learning to rank nodes on a graph and consequently recommend new friendships in social networks. We will also consider the problem of detecting dense overlapping clusters in networks and present efficient model based methods for network community detection.

Lek-Heng Lim (University of Chicago)

Principal Components of Cumulants
March 26, 2012

Multivariate Gaussian data are completely characterized by their mean and covariance but higher-order cumulants are unavoidable in non-Gaussian data. For univariate data, these are well-studied via skewness and kurtosis but for multivariate data, these cumulants are tensor-valued --- higher-order analogs of the covariance matrix capturing higher-order dependence in the data. We argue that multivariate cumulants may be studied via their principal components, defined in a manner analogous to the usual principal components of a covariance matrix. It is best viewed as a subspace selection method that accounts for higher-order dependence the way PCA obtains varimax subspaces. A variant of stochastic gradient descent on the Grassmannian permits us to estimate principal components of cumulants of any order in excess of 10,000 dimensions readily on a laptop computer. This is joint work with Jason Morton.

Maider J. Marin-McGee (University of Puerto Rico)

Poster - Tensor Anisotropic Nonlinear Diffusion for Hyperspectral Image Enhancement
December 31, 1969

Using Hyperspectral Imaging (HSI) with its capacity of sampling the electromagnetic spectrum using hundreds of narrow and contiguous spectral bands is a valuable remote sensing technology for many defense and security applications. Noise introduced by the Hyperspectral sensor and the interfering medium, in addition to the natural spectral variability of the materials, make information extraction more difficult. Pre-processing methods for image enhancement that take full advantage of available spectral information are proposed here to enhance spatial features of interest and denoising. Tensor Anisotropic Nonlinear Diffusion (TAND) is one way to deal with the problem of denoising a HSI. The idea is to use the structure tensor to find the direction where the edges that separate the regions in the scene are. The tensor information is then used either to prevent diffusion from happening across edges (Edge Enhancement Diffusion (EED)) or to enhance or complete the edges or other linear structures (Coherence Enhancement Diffusion (CED)). A new structure tensor is proposed here that gives higher weight to spectral bands with edge information. This variable weighting improves EED and CED enhancement. Results of this method are presented for HSI images of different spatial characteristics to demonstrate the potential capabilities of the method. The proposed image enhancement algorithm can be an integral part of a target detection system using HSI.

Claire Monteleoni (George Washington University)

Poster - Online Clustering with Experts
December 31, 1969

Approximating the k-means clustering objective with an online learning algorithm is an open problem. We introduce a family of online clustering algorithms by extending algorithms for online supervised learning, with access to expert predictors, to the unsupervised learning setting. Instead of computing prediction errors in order to re-weight the experts, the algorithms compute an approximation to the current value of the k-means objective obtained by each expert.

When the experts are batch clustering algorithms with b-approximation guarantees with respect to the k-means objective (for example, the k-means++ or k-means# algorithms), applied to a sliding window of the data stream, our algorithms obtain approximation guarantees with respect to the k- means objective. The form of these online clustering approximation guarantees is novel, and extends an evaluation framework proposed by Dasgupta as an analog to regret. Notably, our approximation bounds are with respect to the optimal k-means cost on the entire data stream seen so far, even though the algorithm is online. Our algorithm’s empirical performance tracks that of the best clustering algorithm in its expert set.

This is joint work with Anna Choromanska.

Jennifer Neville (Purdue University)

How to learn from a single network: Relational learning for social network domains
March 28, 2012

Keywords of the presentation: Relational learning, social networks

Machine learning researchers focus on two distinct learning scenarios for structured graph and network data. In one scenario, the domain consists of a population of structured examples (e.g., chemical compounds) and we can reason about learning algorithms asymptotically, as the number of structured examples increases. In the other scenario, the domain consists of a single, potentially infinite-sized network (e.g., Facebook). In these "single network" domains, an increase in data corresponds to acquiring a larger portion of the underlying network. Even when there are a set of network samples available for learning, they correspond to subnetworks drawn from the same underlying network and thus may be dependent.

Although statistical relational learning have been successfully applied for social network classification tasks, the algorithms were initially developed based on an implicit assumption of an underlying population of networks---which does not hold for most social network datasets. In this talk, I will present our recent efforts to outline a more formal foundation for single network learning and discuss how the analysis has informed the development of more accurate estimation, inference, and evaluation methods.

Robert Nowak (University of Wisconsin, Madison)

Learning on a Budget
March 28, 2012

This talk will deal with the notions of adaptive and non-adaptive information, in the context of statistical learning and inference. Suppose that we have a collection of models (e.g., signals, systems, representations, etc.) denoted by X and a collection of measurement actions (e.g., samples, probes, queries, experiments, etc.) denoted by Y. A particular model x in X best describes the problem at hand and is measured as follows. Each measurement action, y in Y, generates an observation y(x) that is a function of the unknown model. This function may be deterministic or stochastic. The goal is to identify x from a set of measurements y_1(x),...,y_n(x), where y_i in Y, i=1,...,n. If the measurement actions y_1,...,y_n are chosen deterministically or randomly without knowledge of x, then the measurement process is non-adaptive. However, If y_i is selected in a way that depends on the previous measurements y_1(x),...,y_{i-1}(x), then the process is adaptive. Adaptive information is clearly more flexible, since the process can always disregard previously collected data. The advantage of adaptive information is that it can sequentially focus measurements or queries to distinguish the elements of X that are most consistent with previously collected data, and this can dramatically reduce the number of measurements required for reliable decisions. The idea of adaptive information gathering is commonplace (e.g., humans and animals excel at this), but outside of simple parametric settings fairly little is known about the fundamental limits and capabilities of such systems. The key question of interest here is identifying situations in which adaptive information is significantly more effective than non-adaptive information. The answer depends on the interrelationship between the model and measurement spaces X and Y. The talk will cover the general problem, and more deeply consider two illustrative examples from machine learning.

Juan Ramirez Jr. (University of Colorado)

Poster - Towards Inverse Methods for Dimensionality Reduction
December 31, 1969

In this work, we explore methods for the inverse process to dimensionality reduction. That is, the recovery of a high-dimensional observation from its low-dimensional counterpart such that the inherent structure in the data is preserved. Processing high-dimensional data in the ambient space quickly becomes computationally expensive and often impossible. As a result, methods have been developed to reduce the dimensionality of the data to a level where the desired operations can be performed. In cases where the operation of interest produces an out-of-sample point, in the low-dimensional space, one does not have tools to reproduce the out-of-sample point in the ambient space. We propose a method for recovering a best-fit high-dimensional realization from its low-dimensional counterpart. We focus on synthetic data that exhibits a linear relationship between the high-dimensional observations and the low-dimensional latent variables. We have performed experiments on data drawn form statistical distributions and low-dimensional smooth manifolds. Our results suggest proof of concept and our methods are being extended to data that exhibits non-linear relationships.

Pradeep Ravikumar (The University of Texas at Austin)

Greedy Algorithms for Structurally Constrained High Dimensional Problems
March 27, 2012

Keywords of the presentation: High-dimensional Statistics, Greedy Algorithms, Optimization

Modern problems across science and engineering increasingly require high-dimensional modeling: models with more parameters than observations. It is now well understood that statistically reliable inference is still possible under such high-dimensional settings, provided one restricts to constrained subclasses of models with particular low-dimensional structure. Examples include linear regression with sparsity constraints (compressed sensing), estimation of covariance or inverse covariance matrices, sparse principal component analysis, low-rank matrix estimation, and sparse additive non-parametric models. Over the past decade, there has been a strong body of work that have proposed statistical estimators for infering such structurally constrained high-dimensional models, with strong statistical guarantees.

In this talk, we consider the computational facet of such estimation: could one provide a general computational scheme to solve any of the convex optimization problems that arise in such high-dimensional inference? We find that such a general computational scheme is indeed possible: we discuss a scheme based on a greedy strategy. Our framework not only unifies existing greedy algorithms that have proposed for such high-dimensional problems by recovering them as special cases but also yields novel ones.

Based on joint work with Inderjit Dhillon and Ambuj Tewari.

Ben Recht (University of Wisconsin, Madison)

Hogwild for Machine Learning on Multicore
March 28, 2012

Stochastic Gradient Descent (SGD) is a popular optimization algorithm for solving data-driven machine learning problems such as classification, model selection, sequence labeling, and recommendation. SGD is well suited to processing large amounts of data due to its robustness against noise, rapid convergence rates, and predictable memory footprint. Nevertheless, SGD seems to be impeded by many classical barriers to scalability: (1) SGD appears to be inherently sequential, (2) SGD assumes uniform sampling from the underlying data set resulting in poor locality, and (3) current approaches to parallelize SGD require performance-destroying, fine-grained communication.

This talk aims to refute the conventional wisdom that SGD inherently suffers from these impediments. Specifically, I will show that SGD can be implemented in parallel with minimal communication, with no locking or synchronization, and with strong spatial locality. I will provide both theoretical and experimental evidence demonstrating the achievement of linear speedups on multicore workstations on several benchmark optimization problems. Finally, I will close with a discussion of a challenging problem raised by our implementations relating arithmetic and geometric means of positive definite matrices.

Joint work with Feng Niu, Christopher Re, and Stephen Wright.

Karl Rohe (University of Wisconsin, Madison)

Co-clustering for directed graphs: an algorithm, a model, and some asymptotics
March 29, 2012

Keywords of the presentation: spectral clustering

Although the network clustering literature has focused on undirected networks, many networks are directed. For example, communication networks contain asymmetric relationships, representing the flow of information from one person to another. This talk will (1) demonstrate that co-clustering, instead of clustering, is more natural for many directed graphs, (2) propose a spectral algorithm and a statistical model for co-clustering, (3) show some asymptotic results, and (4) present a preliminary analysis of a citation network from Arxiv. Of key interest is the discovery of bottleneck nodes that transmit information between clusters of papers. This is joint work with Professor Bin Yu at UC Berkeley.

Cynthia Rudin (Massachusetts Institute of Technology)

Modeling with Rules
March 29, 2012

Keywords of the presentation: association rules, hierarchical Bayes, mixed-integer optimization

I am working on the design of predictive models that are both accurate and interpretable by a human. These models are built from association rules such as "dyspepsia & epigastric pain -> heartburn." Association rules are commonly used in the data mining community for database exploration, but have not been heavily employed in machine learning or statistics for prediction problems. I will present three algorithms for "decision lists," where classification is based on a list of rules:

1) A very simple Bayesian rule-based algorithm, which is to order rules based on the "adjusted confidence." In this case, users can understand the whole algorithm as well as the reason for the prediction. (This algorithm has a foundation in statistical learning theory, though I will not focus on this during the talk.)

2) A Bayesian hierarchical model for sequentially predicting conditions of medical patients, using association rules. This is essentially a recommender system for medical conditions. The model allows us to borrow strength from similar patients when there are not enough data available about a given patient. This is a hierarchical version of the adjusted confidence algorithm from the first topic.

3) A mixed-inter optimization (MIO) approach for learning decision lists. This algorithm has high accuracy and interpretability - both owing to the use of MIO. In our experiments, this algorithm has interpretability on par with decision trees, and has accuracy on par with boosted decision trees and SVM's with Gaussian kernels. The algorithm has regularization both on the sizes of rules, and also on the total number of rules in the list, leading to small lists.

This is joint work with David Madigan, Tyler McCormick, Allison Chang, and Dimitris Bertsimas.

Publications/Drafts of these works are here: topic 1: http://web.mit.edu/rudin/www/RudinLeKoMaSSRN11.pdf topic 2: http://web.mit.edu/rudin/www/McCormickRuMa11OR38511.pdf topic 3: http://web.mit.edu/rudin/www/BertsimasChRuOR38611.pdf


Ankan Saha (University of Chicago)

Poster - Smoothing Multivariate Performance Measures
December 31, 1969

Optimizing multivariate performance measure is an important task in Machine Learning. Joachims (2005) introduced a Support Vector Method whose underlying optimization problem is commonly solved by cutting plane methods (CPMs) such as SVM-Perf and BMRM. It can be shown that CPMs converge to an accurate solution in $O(1/lambda epsilon)$ iterations, where $lambda$ is the trade-off parameter between the regularizer and the loss function. Motivated by the impressive convergence rate of CPM on a number of practical problems, it was conjectured that these rates can be further improved. We disprove this conjecture in this paper by constructing counter examples. However, surprisingly, we further discover that these problems are not inherently hard, and we develop a novel smoothing strategy, which in conjunction with Nesterov’s accelerated gradient method, can find an accurate solution in $O^*( min {1/epsilon , 1/sqrt{lambda epsilon}})$ iterations. Computationally, our smoothing technique is also particularly advantageous for optimizing multivariate performance scores such as precision/recall break-even point and ROCArea; the cost per iteration remains the same as that of CPMs. Empirical evaluation on some of the largest publicly available datasets shows that our method converges significantly faster than CPMs without sacrificing generalization ability.

Xioatong Shen (University of Minnesota, Twin Cities)

On least squares rank minimization
March 30, 2012

In multivariate analysis, rank minimization emerges when a low-rank structure of matrices is desired in addition to a small estimation error. Rank minimization is nonconvex and generally NP-hard. In this talk, I will present some of our recent results on rank minimization. Computationally, we derive a closed-form for a global minimizer of a nonconvex least squares problem, as well as develop efficient algorithms to compute a global solution as well as an entire regularization solution path. Theoretically, we show that our method reconstructs the oracle estimator exactly for noisy data. Finally, the utility of the proposed method is demonstrated by simulations and image reconstruction from noisy background.

This work is joint with Shuo Xiang, Yunzhang Zhu and Jieping Ye.

Tomas Singliar (The Boeing Company)

Poster - Limited Rationality by Limited Perception
December 31, 1969

Inverse reinforcement learning is now a well established method for creating controllers by imitation learning. It is less often used for predicting behavior, and only rarely for understanding and interpreting behavior. We seek to improve throughput of surveillance data processing by using IRL techniques to detect anomalous actions for the analyst, together with a plausible explanation for the behavior. For this purpose, real-world interpretations must be associated with the basis functions in the linear reward function approximation. We create a practical scheme where the basis is built up in user interactions triggered by false alarms, thus yielding a small and relevant set of basis functions. Experiments revealed limited rationality as the most important barrier to this approach to anomaly detection. The traditional IRL approach to limited rationality is to assume the agent is making noisy decisions with perfect knowledge of the value function. We argue that in reality, the exact value function is (cannot) be known and that people are in fact good at choosing the value-function-maximizing action under believed cost, but have misperceptions of the cost. Thus, we present a formulation of the IRL problem which trades off the planning margin (maximal rationality) and accuracy of action cost perception.

Alexander Johannes Smola (Yahoo! Inc.)

Tutorial - Large Scale Inference
March 26, 2012

Statistical inference for large scale factorization and latent variable model problems is challenging. It requires the ability to partition the state space, to synchronize copies, and to perform distributed updates. Such problems arise in very large scale topic models dealing with 500 million documents, and in graph factorization problems with 200 million vertices.

This talk describes basic tools from systems research for distributing data and computation over hundreds of computers and how to synchronize updates efficiently. We argue in favor of asynchronous updates, both from a systems design and from an experimental point of view. In particular, we show how a distributed approximate Gibbs sampler can be implemented for time-dependent latent variable models and how the method of multipliers can be adapted for large scale graph factorization.

Alexander Johannes Smola (Yahoo! Inc.)

Tutorial - Large Scale Inference
March 26, 2012

Nathan (Nati) Srebro (Toyota Technological Institute at Chicago)

Optimization, Learning and the Universality of Mirror Descent
March 29, 2012

Keywords of the presentation: Statistical Learning Theory, Stochastic Optimization, Online Learning

I will discuss deep connections between Statistical Learning, Online Learning and Optimization. I will show that there is a tight correspondence between the sample size required for learning and the number of local oracle accesses required for optimization, and the same measures of "complexity" (e.g. the fat-shattering dimension or Rademacher complexity) control both of them. Furthermore, I will show how the Mirror Descent method, and in particular its stochastic/online variant, is in a strong sense "universal" for online learning, statistical learning and optimization. That is, for a general class of convex learning/optimization problems, Mirror Descent can always achieve a (nearly) optimal guarantee. In the context of statistical learning, this also implies that for a broad generic class of convex problems, learning can be done optimally (in the worst-case agnostic-PAC sense) with a single pass over the data.

Tobias Strauß (Universität Rostock)

Poster - Design strategies for weight matrices of Echo State Networks
December 31, 1969

We develop approaches to generate dynamical reservoir weights of Echo State Networks with desired properties reducing the amount of randomness. It is possible to create weight matrices with a predefined singular value spectrum. The procedure guarantees stability (Echo State property). We prove the minimization of the impact of noise on the training process. The resulting reservoir types are strongly related to in the literature already known reservoirs. Our experiments show, that well chosen input weights can improve performance.

Joined work with Roger Labahn and Welf Wustlich.

Ryota Tomioka (University of Tokyo)

Poster - Convex Tensor Decomposition with Performance Guarantee
December 31, 1969

We propose a convex optimization based method for tensor decomposition, and analyze its statistical performance. Conventionally tensor decomposition has been formulated as non-convex optimization problems, which hindered the analysis of their performance. We show under some conditions that both the performance of noisy tensor decomposition and tensor completion can be predicted by the quantity we call the normalized rank. Numerical experiments show that our theory can precisely predict the scaling behavior in practice. The current analysis naturally extends the analysis of convex low-rank matrix estimation to tensors. We also discuss some limitations of our theory, open issues, and possible extensions.

Maria C Torres-Madronero (University of Puerto Rico)

Poster - Unsupervised unmixing analysis based on multiscale representation
December 31, 1969

Automated unmixing consists of finding the number of endmembers, their spectral signatures, and their abundances from a hyperspectral image. Most unmixing techniques are pixel–based procedures that do not take advantage of spatial information provided by hyperspectral image. This poster presents a new approach for unmixing analysis of hyperspectral imagery using a multiscale representation method based on nonlinear diffusion for the joint estimation of the number of endmember and their spectral signatures. A set of endmember is identified using the multiscale representation, then this set of endmembers is organized into endmember classes using clustering techniques. Abundances can be estimated using one of several available methods. Experimental results are presented using an AVIRIS image from AP Hill and the False Leaf image collected with the SOC 700 hyperspectral camera.

Ila Tripathi

Poster - Model Calibration & Uncertainty Assessment - A Machine Learning Perspective
December 31, 1969

Often, to use a reservoir model as the basis for investment decisions, we desire it to be calibrated to field measurements. This process of model calibration is known as history matching. History matching typically requires solving an ill-posed inverse problem. It is inherently non-unique and requires a thorough uncertainty assessment. Here, we present an approach that relies on partitioning the parameter space for detailed exploration to identify multiple plausible history-matched models. We use workflows that integrate several machine learning methods for exploring the individual partitions to assess the feasibility of obtaining history-matched models that can be used further for production forecast. We illustrate the utility of the approach by applying it to synthetic reservoir model case study.

Jean-Philippe Vert (École Nationale Supérieure des Mines de Paris)

Group lasso for genomic data
March 28, 2012

Keywords of the presentation: group lasso, feature selection, bioinformatics

The group lasso is an extension of the popular lasso regression method which allows to select predefined groups of features jointly in the context of regression or supervised classification. I will discuss two extensions of the group lasso, motivated by applications in genomic data analysis. First, I will present a new fast method for multiple change-point detection in multidimensional signals, which boils down to a group Lasso regression problem and allows to detect frequent breakpoint location in DNA copy number profiles with millions of probes. Second, I will discuss the latent group lasso, an extension of the group lasso when groups can overlap, which enjoys interesting consistency properties and can be helpful for structured feature selection in high-dimensional gene expression data analysis for cancer prognosis. (Joint work with Kevin Bleakley, Guillaume Obozinski and Laurent Jacob)

Rachel Ward (The University of Texas at Austin)

Robust computer-enabled tests of goodness-of-fit
March 29, 2012

Keywords of the presentation: goodness-of-fit, Pearson's chi-square, Monte-Carlo

If a discrete probability distribution in a model being tested for goodness-of-fit is not close to uniform, forming the Pearson chi-square statistic often involves renormalizing summands to different scales in order to uniformize the asymptotic distribution. This often leads to serious trouble in practice -- even in the absence of round-off errors -- as the talk will illustrate via numerous examples. Fortunately with the now widespread availability of computers, avoiding all the trouble is simple and easy: without renormalization, the actual values taken by goodness-of-fit statistics are not humanly interpretable, but black-box computer programs can rapidly calculate their precise significance.


(joint work with Will Perkins and Mark Tygert)

Manfred K Warmuth (University of California)

Poster - Learning Eigenvectors for Free
December 31, 1969

We extend the classical problem of predicting a sequence of outcomes from a finite alphabet to the matrix domain. In this extension, the alphabet of n outcomes is replaced by the set of all dyads, i.e. outer products uu' where u is a vector in R^n of unit length. Whereas in the classical case the goal is to learn (i.e. sequentially predict as well as) the best multinomial distribution, in the matrix case we desire to learn the density matrix that best explains the observed sequence of dyads. We show how popular online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the n^2 parameters of a density matrix is much harder than learning the n parameters of a multinomial distribution. Completely surprisingly, we prove that the worst-case regrets of certain classical algorithms and their matrix generalizations are identical. The reason is that the worst-case sequence of dyads share a common eigensystem, i.e. the worst case regret is achieved in the classical case. So these matrix algorithms learn the eigenvectors without any regret.

Joint work with Wouter M. Koolen and Wojtek Kotlowski that appeared in NIPS 2011

Manfred K Warmuth (University of California)

The blessing and the curse of the multiplicative updates - discusses connections between in vitro selection, and the multiplicative updates of online learning
March 27, 2012

Multiplicative updates multiply the parameters by nonnegative factors. These updates are motivated by a Maximum Entropy Principle and they are prevalent in evolutionary processes where the parameters are for example concentrations of species and the factors are survival rates. The simplest such update is Bayes rule and we give an in vitro selection algorithm for RNA strands that implements this rule in the test tube where each RNA strand represents a different model. In one liter of the RNA soup there are approximately 10^20 different strands and therefore this is a rather high-dimensional implementation of Bayes rule.

We investigate multiplicative updates for the purpose of learning online while processing a stream of examples. The ``blessing'' of these updates is that they learn very fast because the good parameters grow exponentially. However their ``curse'' is that they learn too fast and wipe out parameters too quickly. We describe a number of methods developed in the realm of online learning that ameliorate the curse of these updates. The methods make the algorithm robust against data that changes over time and prevent the currently good parameters from taking over. We also discuss how the curse is circumvented by nature. Some of nature's methods parallel the ones developed in Machine Learning, but nature also has some additional tricks.

This will be a high level talk. No background in online learning or evolutionary Biology will be required.

Stephen Wright (University of Wisconsin, Madison)

Tutorial - Algorithms for Sparse Optimization and Machine Learning
March 27, 2012

Machine learning, computational statistics, and signal reconstruction in compressed sensing have proved to be a rich source of interesting and challenging optimization problems in recent years. In these and other fields, the optimization formulations require the computed solution to satisfy some structural requirements (such as sparsity or low total variation), as well as the traditional requirements of feasibility and low objective value. Although the optimization problems arising from these areas are quite diverse, there are common features such as large underlying data sets and high-dimensional variable spaces that influence the choice of algorithmic tools most suitable for solving them.

In this tutorial, we start by surveying the space of optimization problems that arise in machine learning and sparse optimization, highlighting the features of each application and the requirements they place on algorithms. We then sketch several fundamental tools from optimization that have proved useful in these applications. These include first-order methods and their accelerated variants, stochastic gradient and incremental gradient methods, augmented Lagrangian, shrinking / thresholding approaches, coordinate relaxation, and higher-order approaches. We also discuss the role of duality and the opportunities for parallel computing.

Stephen Wright (University of Wisconsin, Madison)

Tutorial - Algorithms for Sparse Optimization and Machine Learning
March 27, 2012

Zhixia Yang (Mississippi State University)

Poster - Influenza Antigenic Cartography using Low-Rank Matrix Completion
December 31, 1969

Influenza viruses have been responsible for large losses of lives around the world and continue to present a great public health challenge. Antigenic characterization based on hemagglutination inhibition (HI) assay is one of the routine procedures for influenza vaccine strain selection. However, HI assay is only a crude experiment reflecting the antigenic correlations among testing antigens and reference antisera. Moreover, antigenic characterization is usually based on more than one HI dataset. The combination of multiple datasets results in an incomplete HI matrix with many unobserved entries. We proposed a computational framework for constructing an influenza antigenic cartography from incomplete HI matrix by formulating this problem into a low rank matrix completion algorithm, which is employed to fill in the entries of the HI matrix and also to reduce the noises in HI experiments. A MDS algorithm is utilized to map the antigens (or similarly, antibodies) into a two or three dimensional space for visualization ([ http://sysbio.cvm.msstate.edu/AntigenMap3D ]sysbio.cvm.msstate.edu/ AntigenMap3D). Alternating Gradient Descent (AGD) was employed as the matrix completion algorithm. Simulation and the application of this method in H3N2 data demonstrated the effectiveness and robustness of this method in influenza antigenic cartography construction. More advanced low rank matrix completion algorithms are further developed for this critical biological question.

Xiu-Feng Wan1, Zhi-Peng Cai1, Lamar Barnett1, Zhi-Xia Yang1, Jialiang Yang1, and Tong Zhang2 1Department of Basic Sciences, College of Veterinary Medicine, Mississippi State University, Mississippi State, MS 39762, USA; 2Department of Statistics and Biostatistics, Rutgers University, Piscataway, NJ 08854, USA.

Jialiang Yang (Mississippi State University)

Poster - Identification of influenza antigenic variants using sequence data
December 31, 1969

We propose a computational framework, which integrates feature selection methods like lasso and ridge selection and antigenic cartography, to identify single or coupled mutations from genomic sequence causing antigenic drift events for influenza viruses. This algorithm was applied and experimentally proved to be successful in H3N2 and H5N1 Influenza A virus datasets. By using the identified key mutation sites, we were able to conduct sequence-based antigenic characterization and predict the antigenic evolution of influenza viruses.

Xiu-Feng Wan1 , Jialiang Yang1, Hailiang Sun1, Zhipeng Cai1, Mariette F. Ducatez2, Richard J. Webby2, Li-Ping Long1 and Tong Zhang3

1Department of Basic Sciences, College of Veterinary Medicine, Mississippi State University, Mississippi State, MS 39762, USA; 2 Department of Infectious Diseases, St. Jude Children's Research Hospital, 262 Danny Thomas Place, Memphis, TN 38105, USA; 3 Department of Statistics, Rutgers University, Piscataway, NJ 08854, USA

Teng Zhang (Princeton University)

Poster - Sparse Precision Matrix Estimation via Positive Denite Constrained Minimization of L1 Penalized D-Trace Loss
December 31, 1969

We introduce a new collection of convex loss functions for estimating precision matrices, including the likelihood function of Gaussian graphical model as a special case. Another interesting special case gives rise to a simple loss function called the D-Trace loss which is expressed as the difference of two trace operators. We then introduce a new sparse precision matrix estimator defined as the minimizer of the ℓ1 penalized D-Trace loss under a positive definite constraint. We develop a very efficient algorithm based on alternating direction methods for computing the positive definite constrained ℓ1 penalized D-Trace loss estimator. Under a new irrepresentable condition our estimator has the sparse recovery property. We establish rates of convergence of our estimator under the element-wise maximum norm, Frobenius norm and operator norm. Simulated and real data are used to demonstrate the computational efficiency of our algorithm and the finite sample performance of our estimator. It is shown that our estimator compares favorably with the ℓ1 penalized MLE.

This is a joint work with Hui Zou

Weifeng Zhi (University of Kentucky)

Poster - Discrete Hessian Eigenmaps Method for Dimensionality Reduction
December 31, 1969

For a given set of data points lying on a low-dimensional manifold embedded in a high-dimensional space, the dimensionality reduction is to recover a low-dimensional parametrization from the data set. The Hessian Eigenmaps is a mathematically rigorous method that also sets a theoretical framework for the nonlinear dimensionality reduction problem. We present a discrete version of the Hessian Eigenmaps method and provide an analysis, giving conditions under which the method works as intended. As an application, a procedure to modify the standard constructions of k-nearest neighborhoods is presented to ensure that Hessian LLE can recover the original coordinates up to an affine transformation.

Joint work with Qiang Ye

Connect With Us: