We present a parallel scan (prefix sum) algorithm in the Tensor Core Unit (TCU) model of computation. The TCU model assumes that multiplication between two square matrices of constant size s is a basic operation. In the (s2,ℓ)-TCU model, we show that for inputs of size n, the algorithm has depth at most 2⌊logs(n)⌋ and runs in O(n(1+ℓ/s2)/p+(s2+ℓ)logs(n)) time assuming p tensor core units. Equivalently, the algorithm performs O(n/s2) multiplications of square matrices of size s.
@article{arxiv.2411.17887,
title = {A Parallel Scan Algorithm in the Tensor Core Unit Model},
author = {Anastasios Zouzias and William F. McColl},
journal= {arXiv preprint arXiv:2411.17887},
year = {2024}
}
Comments
14 pages, published in 29th International European Conference on Parallel and Distributed Computing (EuroPar 2023)