Robust Statistics: Challenges of High Dimensions

Tuesday, March 27, 2012 - 3:30pm - 4:15pm
Keller 3-180
Constantine Caramanis (The University of Texas at Austin)
In this talk we revisit (high dimensional) sparse regression -- the topic of much recent study and attention. Unlike the standard setting where covariates (the sensing matrix entries) are assumed to be known perfectly and the output (the response variables) free of persistent errors/corruption, real world applications may be plagued by both. Meanwhile, it is well known that while computationally efficient and statistically consistent in the standard setting, the presence of even a single corrupted point is enough to destroy the performance of regression algorithms like Lasso, Orthogonal Matching Pursuit, and others.

We consider support recovery in the case of arbitrary corruptions in both the response variables and the covariates, and ask how many such corruptions we can tolerate, compared to the total number of observations, while still allowing support recovery. We show that this is significantly more challenging than corruption only in the response variable. Indeed, to the best of our knowledge, there are no techniques that provide guarantees this setting; in particular, EM based algorithms, as well as Lasso, OMP and the recently analyzed Justice Pursuit, fail. We provide a robust regression algorithm that is simple, tractable, and can correct an increasing number of such arbitrarily corrupted points. Perhaps surprisingly, we show that the natural (and exponential time) brute force algorithm that minimizes regression error over all possible subsets of points and support, performs significantly worse than our efficient algorithm, in terms of support recovery.

En route, we discuss the challenges of robust statistics in high dimensions, the reason classical approaches fail in this regime, and, perhaps interestingly, suggest why robust regression is fundamentally more difficult than robust PCA.
MSC Code: