Large Scale Subspace Clustering
Friday, September 16, 2016 - 10:00am - 10:50am
Subspace clustering refers to the problem of clustering data drawn from a union of low-dimensional subspaces of a high-dimensional space. State-of-the-art subspace clustering methods are based on expressing each data point as a linear combination of other data points while regularizing the matrix of coefficients with the L1, L2, or nuclear norms. While such methods have become very popular due to their simplicity, theoretical guarantees and empirical success, their applicability to large datasets (beyond 10K data points) has been limited due to time complexity and/or memory footprint issues. In this talk, I will present two scalable subspace clustering methods. The first one is based on orthogonal matching pursuit (OMP), a well-known and computationally efficient greedy algorithm for L0 minimization. Our results show that OMP is guaranteed to give a subspace-preserving affinity (i.e., there are no connections between points from different subspaces) under broad conditions (e.g., arbitrary subspaces and corrupted data). However, since the affinity produced by OMP/L1 is sparse, each cluster may not be well-connected. The second method addresses both the scalability and connectivity issues by using the elastic net (EN) regularizer, a mixture of the L1 and L2 norms. Our geometric analysis of EN explains why it enhances connectedness (due to L2 regularization) while maintaining the subspace-preserving property (due to L1 regularization). We also derive a provably correct and scalable active set method for solving the EN optimization problem. Experiments on synthetic data verify our theoretical analysis, and show that the proposed methods not only achieve state-of-the-art clustering performance, but also efficiently handle datasets with 500K data points.