English

Improved Distributed Principal Component Analysis

Machine Learning 2014-12-24 v5

Abstract

We study the distributed computing setting in which there are multiple servers, each holding a set of points, who wish to compute functions on the union of their point sets. A key task in this setting is Principal Component Analysis (PCA), in which the servers would like to compute a low dimensional subspace capturing as much of the variance of the union of their point sets as possible. Given a procedure for approximate PCA, one can use it to approximately solve 2\ell_2-error fitting problems such as kk-means clustering and subspace clustering. The essential properties of an approximate distributed PCA algorithm are its communication cost and computational efficiency for a given desired accuracy in downstream applications. We give new algorithms and analyses for distributed PCA which lead to improved communication and computational costs for kk-means clustering and related problems. Our empirical study on real world data shows a speedup of orders of magnitude, preserving communication with only a negligible degradation in solution quality. Some of these techniques we develop, such as a general transformation from a constant success probability subspace embedding to a high success probability subspace embedding with a dimension and sparsity independent of the success probability, may be of independent interest.

Keywords

Cite

@article{arxiv.1408.5823,
  title  = {Improved Distributed Principal Component Analysis},
  author = {Maria-Florina Balcan and Vandana Kanchanapally and Yingyu Liang and David Woodruff},
  journal= {arXiv preprint arXiv:1408.5823},
  year   = {2014}
}
R2 v1 2026-06-22T05:38:56.055Z