September 26 - 30, 2011
We propose new convex
relaxations for the maximum clique, maximum edge biclique, and clique
partitioning problems. These relaxations are based on the observation
that there is a one-to-one correspondence between cliques or bicliques
in the input graph and rank-one blocks in its adjacency matrix. These
problems may be formulated as rank minimization or rank-constrained
optimization problems, which are subsequently relaxed to convex programs
using the nuclear norm. In the case that the input graph contains an
instance of the desired structure plus diversionary edges and nodes,our
relaxation is {bf exact}: solving the relaxation recovers the optimal
solution for the original hard combinatorial problem. For each problem,
we provide theoretical bounds on the size of the planted subgraph and
amount of noise that our algorithm can tolerate and still recover the
exact solution. Our heuristic is applicable to any input graph, and we
provide numerical examples that empirically demonstrate the
effectiveness of our algorithm for graphs drawn from clustering and
image segmentation applications.
Keywords of the presentation: Convex optimization, submodular functions
Sparse methods for supervised learning aim at ﬁnding good
linear predictors from as few variables as possible, i.e., with small
cardinality of their supports. This combinatorial selection problem is
often turned into a convex optimization problem by replacing the
cardinality function by its convex envelope (tightest convex lower bound),
in this case the L1-norm. In this work, we investigate more general set-functions than the cardinality, that may incorporate prior knowledge
or structural constraints which are common in many applications:
namely, we show that for nondecreasing submodular set-functions, the
corresponding convex envelope can be obtained from its Lovasz extension,
a common tool in submodular analysis. This deﬁnes a family of polyhedral
norms, for which we provide generic algorithmic tools (subgradients
and proximal operators) and theoretical results (conditions for support
recovery or high-dimensional inference). By selecting speciﬁc submodular
functions, we can give a new interpretation to known norms, such as those
based on rank-statistics or grouped norms with potentially overlapping
groups; we also deﬁne new norms, in particular ones that can be used
as non-factorial priors for supervised learning.
Low-dimensional linear subspace approximations to high-dimensional data are a common approach to handling problems of estimation, detection and prediction, with applications such as network monitoring, collaborative filtering, object tracking in computer vision, and environmental sensing. Corrupt and missing data are the norm in many high-dimensional situations, not only because of errors and failures, but because it may be impossible to collect all the interesting measurements or impractical to compute in real-time using all the data available. Recently developed algorithms for "Matrix Completion" and "Robust PCA" have offered tools to find low-dimensional approximations to data even when data are missing and corrupt. However, these algorithms operate on all the available data at once and assume that the underlying low-dimensional subspace remains constant. This poster describes two alternatives, a subspace tracking algorithm called GROUSE (Grassmannian Rank-one Update Subspace Estimation) and its robust counterpart GRASTA (Grassmannian Robust Adaptive Subspace Tracking Algorithm). Both are incremental algorithms that process one incomplete and/or corrupted measurement vector at a time. Thus GROUSE and GRASTA operate with considerably less computation than the other algorithms, while at the same time allowing flexibility for real-time applications such as tracking and anomaly detection. I will present these algorithms and show their application to subspace tracking, matrix completion, robust PCA, and background and foreground separation in video surveillance.
We develop an information-theoretic perspective on some questions in high-dimensional convex geometry. First, we describe a new equipartition property for log-concave probability measures, which is a precise formulation of the intuition that log-concave distributions are effectively uniformly distributed on compact sets. Second, we develop some Gaussian comparison results for log-concave measures, including some entropic formulations of the hyperplane conjecture. Third, we establish a new reverse entropy power inequality for convex measures analogous to V. Milman's reverse Brunn-Minkowski inequality.
Keywords of the presentation: spectral algorithms, CCA, graphical models, latent trees
This talk considers the problem of learning the structure of a broad
class of multivariate latent variable tree models, which include a
variety of continuous and discrete models (including the widely used
linear-Gaussian models, hidden Markov models, and Markov
evolutionary trees). The setting is one where we only have samples
from certain observed variables in the tree and our goal is to
estimate the tree structure (emph{i.e.}, the graph of how the underlying
hidden variables are connected to the observed variables). We
provide the Spectral Recursive Grouping algorithm, an efficient and
simple bottom-up procedure for recovering the tree structure from
independent samples of the observed variables. Our finite sample
size bounds for exact recovery of the tree structure elucidate
certain natural dependencies on underlying statistical and
structural properties of the underlying joint
distribution. Furthermore, our sample complexity guarantees have no
explicit dependence on the dimensionality of the observed variables,
making the algorithm applicable to many high-dimensional settings.
At the heart of our algorithm is a spectral quartet test for
determining the relative topology of a quartet of variables, which
only utilizes certain second order statistics and is based
on the determinants of certain cross-covariance matrices.
This talk is based on joint work with A. Anandkumar, D. Hsu, S. Kakade, L. Song and T. Zhang.
Modeling high dimensional data by multiple low-dimensional planes
is an important problem in many applications such
as computer vision and pattern recognition. In
the most general setting where only coordinates of
the data are given, the problem asks to determine
the optimal model parameters, estimate the model
planes, and cluster the data accordingly. Though
many algorithms have been proposed, most of them
need to assume prior knowledge of the model parameters
and thus address only the last two components of the problem.
In this paper we propose an accurate and efficient algorithm
based on multiscale SVD analysis and spectral methods to tackle
the problem.
Keywords of the presentation: Harmonic Analysis , dimensional reduction , machine learning
We describe a mathematical framework for dimensional reduction, learning and organization of databases . The database could be a matrix of a linear transformation for which the goal is to reorganize the matrix so as to achieve compression and fast algorithms. Or the database could be a collection of documents and their vocabulary, an array of sensor measurements such as EEG , or financial a time series or segments of recorded music. If we view the database as a questionnaire , we organize the population into a contextual demographic diffusion geometry and the questions into a conceptual geometry, this is an iterative process in which each organization informs the other, with the goal of entropy reduction of the whole data base.
This organization being totally data agnostic applies to the other examples thereby generating automatically a data driven conceptual /contextual pairing.We will describe the basic underlying tools from Harmonic Analysis for measuring success in extracting structure , tools which enable functional regression prediction and basically signal processing methodologies.
Keywords of the presentation: Semidefinite programming.
This tutorial will briefly cover recent developments in semidefinite programming and some of their geometrical applications.
The Delsarte-Goethals frame has been proposed for deterministic compressive sensing of sparse and compressible signals. Its performance in compressive imaging applications falls short of that obtained for arbitrary sparse vectors. Prior work has proposed specially tailored signal recovery algorithms that partition the recovery of the input vector into clustered and unclustered portions. We present a formal analysis of the Delsarte-Goethals frame that shows that its hampered performance in compressive imaging is due to the presence of clustered sparse vectors in its null space. Such a clustered structure is characteristic in commonly-used raster scanning of 2-D wavelet representations. We also show that a simple randomized raster scanning of the wavelet coefficient vector can remedy these shortcomings, allowing the use of standard recovery algorithms. Additionally, we propose an alternative deterministic raster scanning that achieves similar recovery performance. Experimental results confirm the improvements in recovery performance for both the noiseless and noisy measurement regimes.
This is joint work with Sina Jafarpour and Robert Calderbank.
We consider a deviation from the mean in the problem
begin{equation}
D(x)=sup_{fin F}mu{f-E_{mu} fgeq x},qquad text{for}
xinRlabel{abstr}
end{equation} for the class $F$ of all integrable, $fin L_1(V,d,mu)$,
Lipschitz functions on a probability metric space $(V,d,mu)$. Consider the
following closely related isoperimetric problem. Given $t>0$,
begin{equation}
text{minimize} mu(A^h) text{over all} Asubset V text{with} mu(A)geq
t, label{absext}
end{equation} where $A^h={uin V: d(u,A)leq h}$ is an $h$-neighborhood of
$A$. We solve the problem $eqref{abstr}$ for spaces $(V,d,mu)$ such that for
every $t geq 0$ there exists a solution of $eqref{absext}$ which
does not depend
on $h$.
As corollaries we get exact solutions of $eqref{abstr}$ for Euclidean unit
sphere $S^{n-1}$ with a geodesic distance and a normalized Haar measure, for
$R^n$ equipped with a Gaussian measure and for $n$-dimensional cube, rectangle,
powers of tori, Diamond graph equipped with uniform measure and Hamming
distance. We also obtain a general result for probability metric spaces. Let
$F^{ast}$ be a family of distance functions $f(u)=-d(A,u)$, with $Asubset V$
measurable. For all probability metric spaces we prove that
$$
sup_{finF}mu{f-E_{mu} fgeq x}=sup_{finF^{ast}}mu{f-E_{mu}
fgeq x},qquad xinR.
$$
In this paper, we study a simple correlation-based strategy for estimating the unknown delay and amplitude of a signal based on a small number of noisy, randomly chosen frequency-domain samples. We model the output of this "compressive matched filter" as a random process whose mean equals the scaled, shifted autocorrelation function of the template signal. Using tools from the theory of empirical processes, we prove that the expected maximum deviation of this process from its mean decreases sharply as the number of measurements increases, and we also derive a probabilistic tail bound on the maximum deviation. Putting all of this together, we bound the minimum number of measurements required to guarantee that the empirical maximum of this random process occurs sufficiently close to the true peak of its mean function. We conclude that for broad classes of signals, this compressive matched filter will successfully estimate the unknown delay (with high probability, and within a prescribed tolerance) using a number of random frequency-domain samples that scales inversely with the signal-to-noise ratio and only logarithmically in the in the observation bandwidth and the possible range of delays.
The theory of Distilled Sensing (DS) establishes that sequential adaptive sensing provides significant quantifiable improvements in effective signal-to-noise ratio (relative to non-adaptive sensing) in noisy high dimensional testing problems. The DS idea was recently extended to highly undersampled settings typical in compressed sensing (CS), and it was shown that a Compressive Distilled Sensing (CDS) method can produce similar performance gains in noisy CS problems. However, the results obtained in the initial work on CDS were restricted to settings where the unknown sparse signals of interest have limited dynamic range (on the order of a constant). In this work we describe a new adaptive compressive sensing approach which utilizes sequentially designed sparse measurement matrices. Our analysis of this approach shows that the quantifiable benefits of adaptivity in noisy CS problems can be achieved without restrictive conditions on the dynamic range of the signal being acquired.
This is joint work with Rich Baraniuk (Rice University), Rui Castro (Eindhoven University of Technology), and Rob Nowak (University of Wisconsin.
Entropy estimation is a common approach to empirically determining the spread of non-spherical distributions and has been applied to areas such as anomaly detection, intrinsic dimension estimation, information fusion, and structure estimation in factor graphs. k-NN graph estimators are a popular choice for estimating entropy with desirable properties including consistency and ease of implementation. However, in the high dimensional setting, the bias of k-NN graph estimators dominates the mean square error (MSE) and decays very slowly in the sample size T at a rate of order T^(-1/d) where d is the dimension. We present a weighted k-NN estimator that improves the MSE convergence rate to T^(-1) independent of dimension. We illustrate the weighted estimator for an anomaly detection problem in wireless sensor networks.
Keywords of the presentation: dependency graphs, sparse correlation models, hubs of high vertex degree, false discovery error control, local Bhattacharrya divergence
Random matrices are measured in many areas of engineering, social science, and natural science. When the rows of the matrix are random samples of a vector of dependent variables the sample correlations between the columns of the matrix specify a correlation graph that can be used to explore the dependency structure. However, as the number of variables increases screening for highly correlated variables becomes futile due to a high dimensional phase transition phenomenon: with high probability most variables will have large sample correlations even when there is no correlation present. We will present a general theory for predicting the phase transition and provide Poisson limit theorems that can be used to predict finite sample behavior of the correlation graph. We apply our framework to the problem of hub discovery in correlation and partial correlation (concentration) graphs. We illustrate our correlation screening framework in computational biology and, in particular, for discovery of influential variables in gene regulation networks. This is joint work with Bala Rajaratnam at Stanford University.
Keywords of the presentation: sparse recovery, compressive sensing, adaptivity
The goal of sparse recovery is to recover a sparse (with few non-zeros) approximation of data from the available information. We consider the problem of recovering the (approximately) best k-sparse approximation x* of an n-dimensional vector x from linear measurements of x. It is known that this task can be accomplished using m=O(k log (n/k)) *non-adaptive* measurements and that this bound is tight.
In this talk we show that if one is allowed to perform measurements that are adaptive, then the number of measurements can be considerably (sometimes exponentially) reduced, compared to the non-adaptive case. We also discuss implications of our results to data stream computing and compressive sensing.
Read More...
We present an efficient method for computing an
extendable re-parameterization of high dimensional stochastic data sets
generated by nonlinear mixing of independent physical parameters. Our method relies on the spectral decomposition of a particular diffusion kernel
constructed from observed data. According to Sturm-Liouville oscillation
theory, a subset of the kernel eigenvectors are monotonic functions of
the original parameters. Thus, a re-parameterization via these
eigenvectors provides an inverse map back to the space of physically
meaningful parameters. We further show how the inverse map can be
efficiently extended to newly observed data, and how the extension
enhances the performance of data classification.
We demonstrate the use of our method for solving empirically inverse problems such as the
reconstruction of geological formations from electro-magnetic measurements, and source localization
in Acoustics.
The least absolute shrinkage and selection operator (Lasso) for linear regression exploits the geometric interplay of the l2-data error objective and the l1-norm constraint to arbitrarily select sparse models. Guiding this uninformed selection process with sparsity models has been precisely the center of attention over the last decade in order to improve learning performance. To this end, we alter the selection process of Lasso in this paper to explicitly leverage combinatorial sparsity models (CSMs) via the combinatorial selection and least absolute shrinkage (CLASH) algorithm. A highlight is the introduction of a new algorithmic definition of CSMs, which we dub as the Polynomial time Modular ε-Approximation Property (PMAPε). PMAPε enables us to determine the impact of approximate combinatorial projections within CLASH. We then provide concrete guidelines how to leverage sets with PMAPε within CLASH, and characterize CLASH's estimation guarantees as a function of ε as well as the set restricted isometry constants of the regression matrix. Finally, we present experimental results using both simulated and real world data to demonstrate the effectiveness of CLASH.
Keywords of the presentation: Small Value Probability, Metric Entropy, Rare Events, Gaussian Measures
Small value probabilities or small deviations study the decay probability that positive random variables behave near zero.
In particular, small ball probabilities provide the asymptotic behavior of the probability measure inside a ball as the radius of the
ball tends to zero.
Metric entropy is defined as the logarithmic of the minimum covering number of compact set by balls of very small radius.
In this talk, we will provide an overview on precise connections between the small value probability
of Gaussian process/measure and the metric entropy of the associated compact operator.
Interplays and applications to many related problems/areas will be given, along with various fundamental tools and techniques from high dimensional probability theory.
Throughout, we use Brownian motion (integral operator) and Brownian
sheets (tensored Brownian motion and operator) as illustrating examples.
The theory of Compressive Sensing (CS) has enabled the efficient acquisition of high-bandwidth (but sparse) signals via nonuniform low-rate sampling protocols. While most work in CS
has focused on reconstructing the high-bandwidth signals from nonuniform low-rate samples, in this work, we consider the task of inferring the modulation of a communications signal directly in
the compressed domain, without requiring signal reconstruction. We show that the Nth power nonlinear features used for Automatic Modulation Recognition (AMR) are compressible in
the Fourier domain, and hence, that AMR of M-ary Phase-Shift-Keying (MPSK) modulated signals is possible by applying the same nonlinear transformation on nonuniform compressive
samples. We provide analytical support for the accurate approximation of AMR features from nonuniform samples, present practical rules for classification of modulation type using these
samples, and validate our proposed rules on simulated data.
We show how alternating direction methods are used recently in
sparse and low-rank optimization problems. We also show the iteration
complexity bounds for two specific types of alternating direction
methods, namely alternating linearizaton and multiple splitting methods.
We prove that the basic versions of both methods require $O(1/eps)$
iterations to obtain an eps-optimal solution, and the accelerated
versions of these methods require $O(1/sqrt{eps})$ iterations, while
the computational effort needed at each iteration is almost unchanged.
Numerical results on problems with millions of variables and constraints
arising from computer vision, machine learning and image processing are
reported to demonstrate the efficiency of the proposed methods.
We discuss several geometric approaches to the study of data sets in high-dimensional spaces that are assumed to have low-intrinsci dimension. On the one hand, we discuss diffusion geometry type of approaches, based on constructing proximity graphs between the data points in high dimensions, and using diffusion processes on such graphs to construct coordinates for the data and perform learning tasks. On the other hand, we discuss novel approaches based on multiscale geometric analysis, based on studying the behavior of local covariance matrices of the data at different scales. We apply this latter approach to intrinsic dimension estimation, multiscale dictionary learning, and density estimation.
We develop a novel geometric multiresolution analysis for analyzing intrinsically low dimensional point clouds in high-dimensional spaces, modeled as samples from a d-dimensional set M (in particular, a manifold) embedded in D-dimensional Euclidean space, in the regime d<<D. This type of situation has been recognized as important in various applications, such as the analysis of sounds, images, and gene arrays. In this paper we construct data-dependent multiscale dictionaries that aim at efficient encoding and manipulating of the data. Unlike existing constructions, our construction is fast, and so are the algorithms that map data points to dictionary coefficients and vice versa. In addition, data points have a guaranteed sparsity in terms of the dictionary.
We consider supervised learning problems where the features are embedded in a directed graph, and whose solutions are known to be a priori sparse. We address the task of selecting variables corresponding to connected components of the graph, and leverage to that effect recent work on structured sparsity. We introduce convex and non-convex regularization functions achieving this goal, which involve an exponential number of groups of variables, and show that when the graph is acyclic the combinatorial problems related to these formulations can be solved with network flow optimization techniques. We present experiments on image and genomic data, demonstrating the benefits of these penalties over existing ones as well as the scalability of our approach for prediction tasks.
This is a joint work with Bin Yu (UC Berkeley).
The information content of many phenomena of practical interest is often much less than what is suggested by their actual size. As an inspiring example, one active research area in biology is to understand the relations between the genes. While the number of genes in a so-called gene network can be large, the number of contributing genes to each given gene in the network is usually small compared to the size of the network. In other words, the behavior of each gene can be expressed as a sparse combination of other genes. The purpose of this work is to develop new theory and algorithms for exploiting this type of simplicity in the analysis of high-dimensional dynamical systems with a particular focus on system identification and estimation.
We look at high-dimensional but sparse systems, time varying systems with few piecewise-constant parameter changes, and large-scale interconnected dynamical systems when the graph has a sparse flow. Inspired by the field of Compressive Sensing (CS), we call these problems Compressive System Identification (CSI), Compressive Topology Identification (CTI), etc.
Keywords of the presentation: multiple testing, sparse recovery, sequential testing
Clustering and ranking is often based on pairwise similarities (metric data) or comparisons (ordinal data). Most methods assume that the entire collection of all possible pairwise similarities or comparisons are known, but in high-dimensional settings there may be missing data and/or the costs of collecting this information may be prohibitive. The existence of clusters or other intrinsic low-dimensional structure can induce redundancy in these data, and therefore it may be possible to robustly cluster or rank based on a small subset of the data. I will discuss two results that demonstrate this phenomenom. First, I will describe a clustering procedure that generates a hierarchical clustering of N objects using only N log N similarities, instead of all N(N-1)/2 similarities. The method can recover all clusters of size larger than log N, even in the presence of a limited fraction of arbitrarily corrupted or noisy similarities. Second, if N objects are embedded in d-dimensions (d
Keywords of the presentation: low-rank matrix approximation, subset selection
I will discuss recent algorithmic developments for the classical problem of approximating a given matrix by a low-rank matrix. This is motivated by the need of faster algorithms for very large data and certain applications that want the approximating matrix to have rows living in the span of only a few rows of the original matrix, which adds a combinatorial twist to the problem. The novel algorithms are based on sampling rows randomly (but non-uniformly) and random projection, from which a low rank approximation can be computed.
Keywords of the presentation: Eigenvalues, sample covariance matrix, population matrix
I will begin by reviewing limiting properties of the eigenvalues of a class of sample covariance matrices, where the vector dimension and the sample size
approach infinity, their ratio approaching a positive constant. These properties are relevant in situations in multivariate analysis where the vector dimension is large, but the number of samples needed to adequately approximate the population matrix (as prescribed in standard statistical procedures) cannot be attained. Work has been done in estimating the population eigenvalues from those of the sample covariance matrix. I will introduce a method devised by X. Mestre, and will present an extension of his method to another ensemble of random matrices important in wireless communications.
Despite the empirical success of spectral clustering, its performance under noise and in high dimensions is not well understood. We provide a precise characterization of the misclustering rate of spectral clustering under noisy similarities. Using minimax theory, we establish information theoretic lower bounds on the amount of noise any clustering algorithm can tolerate and demonstrate that spectral clustering is near-optimal. The gap relative to the optimal performance is the statistical price for computational efficiency gained by solving a relaxation of the minimum ratio-cut problem. Additionally, in high dimensional settings, the clusters can be very small and are often characterized by only a few relevant features. To enable simultaneous extraction of such a cluster and the relevant features, we consider a sparsity penalized spectral biclustering method and compare its performance to minimax limits.
Almost any kind of experimental sciences requires the analysis of the data. Depending on the specific application, the collection
of data may consist of measurements, signals, or images. In mathematical framework, all of those objects are functions. One
way to analyze them is by representing into a wavelet decomposition. Wavelets provide reconstruction (approximation) of the original
function (the collection of data). In order to characterize the approximation class, one has to establish Bernstein type
inequalitiy which helps to investigate the magnitude of the coefficients in a wavelet decomposition of the
function. The coefficients tell in what way the analyzing function needs to be modified in order to reconstruct the data.
We analyze wavelet coefficients for both the family of Daubechies orthonormal wavelets and the
family of semiorthogonal spline wavelets in L_p space. In particular, Bernstein type inequalities associated with wavelets
are presented. We obtain a sharp estimate of Bernstein type inequality for splines. We also study the asymptotic
behavior of wavelet coefficients for both families mentioned above. Comparison of these two families is done by using a quantity C_{k,p}
related to the wavelet generators. In addition we determine the exact rate of decay of the m-th order B-spline wavelet and thereby describe
the asymptotic behavior of the wavelet. Finally we will use splines in order to get the asymptotically best constant in Ball's integral inequality.