English

A Parallel Scan Algorithm in the Tensor Core Unit Model

Distributed, Parallel, and Cluster Computing 2024-11-28 v1 Data Structures and Algorithms

Abstract

We present a parallel scan (prefix sum) algorithm in the Tensor Core Unit (TCU) model of computation. The TCU model assumes that multiplication between two square matrices of constant size ss is a basic operation. In the (s2,)(s^2, \ell)-TCU model, we show that for inputs of size nn, the algorithm has depth at most 2logs(n)2\lfloor \log_s (n)\rfloor and runs in O(n(1+/s2)/p+(s2+)logs(n))O(n(1 + \ell /s^2)/p + (s^2 + \ell) \log_s (n)) time assuming pp tensor core units. Equivalently, the algorithm performs O(n/s2)O(n/s^2) multiplications of square matrices of size s.

Keywords

Cite

@article{arxiv.2411.17887,
  title  = {A Parallel Scan Algorithm in the Tensor Core Unit Model},
  author = {Anastasios Zouzias and William F. McColl},
  journal= {arXiv preprint arXiv:2411.17887},
  year   = {2024}
}

Comments

14 pages, published in 29th International European Conference on Parallel and Distributed Computing (EuroPar 2023)