Communication Efficient Algorithms for Distributed Machine Learning

Monday, May 16, 2016 - 11:10am - 12:00pm
Keller 3-180
Nina Balcan (Carnegie-Mellon University)
We consider the problem of learning from distributed data and analyze fundamental algorithmic and communication complexity questions involved. Broadly, we consider a framework where information is distributed between several locations, and our goal is to learn a low-error hypothesis with respect to the overall data by using as little communication, and as few rounds of communication, as possible. As an example, suppose k research groups around the world have collected large scientific datasets, such as genomic sequence data or sky survey data, and we wish to perform learning over the union of all these different datasets without too much communication.

In this talk, I will first discuss a general statistical or PAC style framework for analyzing communication complexity issues involved when doing distributed supervised machine learning, i.e., learning from annotated data distributed across multiple locations. I will also discuss algorithms with good communication complexity for unsupervised learning and dimensionality reduction problems, with interesting connections to efficient distributed coreset construction.
MSC Code: