English

Communication-avoiding Cholesky-QR2 for rectangular matrices

Distributed, Parallel, and Cluster Computing 2019-06-18 v6 Mathematical Software

Abstract

Scalable QR factorization algorithms for solving least squares and eigenvalue problems are critical given the increasing parallelism within modern machines. We introduce a more general parallelization of the CholeskyQR2 algorithm and show its effectiveness for a wide range of matrix sizes. Our algorithm executes over a 3D processor grid, the dimensions of which can be tuned to trade-off costs in synchronization, interprocessor communication, computational work, and memory footprint. We implement this algorithm, yielding a code that can achieve a factor of Θ(P1/6)\Theta(P^{1/6}) less interprocessor communication on PP processors than any previous parallel QR implementation. Our performance study on Intel Knights-Landing and Cray XE supercomputers demonstrates the effectiveness of this CholeskyQR2 parallelization on a large number of nodes. Specifically, relative to ScaLAPACK's QR, on 1024 nodes of Stampede2, our CholeskyQR2 implementation is faster by 2.6x-3.3x in strong scaling tests and by 1.1x-1.9x in weak scaling tests.

Keywords

Cite

@article{arxiv.1710.08471,
  title  = {Communication-avoiding Cholesky-QR2 for rectangular matrices},
  author = {Edward Hutter and Edgar Solomonik},
  journal= {arXiv preprint arXiv:1710.08471},
  year   = {2019}
}