English

Large Scale Distributed Linear Algebra With Tensor Processing Units

Computational Physics 2022-09-14 v1 Distributed, Parallel, and Cluster Computing Quantum Physics

Abstract

We have repurposed Google Tensor Processing Units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers. The TPUs' fast inter-core interconnects (ICI)s, physically two-dimensional network topology, and high-bandwidth memory (HBM) permit distributed matrix multiplication algorithms to rapidly become computationally bound. In this regime, the matrix-multiply units (MXU)s dominate the runtime, yielding impressive scaling, performance, and raw size: operating in float32 precision, a full 2048-core pod of third generation TPUs can multiply two matrices with linear size N=220=1048576N= 220= 1 048 576 in about 2 minutes. Via curated algorithms emphasizing large, single-core matrix multiplications, other tasks in dense linear algebra can similarly scale. As examples, we present (i) QR decomposition; (ii) resolution of linear systems; and (iii) the computation of matrix functions by polynomial iteration, demonstrated by the matrix polar factorization.

Keywords

Cite

@article{arxiv.2112.09017,
  title  = {Large Scale Distributed Linear Algebra With Tensor Processing Units},
  author = {Adam G. M. Lewis and Jackson Beall and Martin Ganahl and Markus Hauru and Shrestha Basu Mallick and Guifre Vidal},
  journal= {arXiv preprint arXiv:2112.09017},
  year   = {2022}
}

Comments

12 pages, 8 figures