English

GPU Tensor Cores for fast Arithmetic Reductions

Distributed, Parallel, and Cluster Computing 2020-01-17 v1

Abstract

This work proposes a GPU tensor core approach that encodes the arithmetic reduction of nn numbers as a set of chained m×mm \times m matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running time of the proposed chained tensor core approach is T(n)=5logm2nT(n)=5 log_{m^2}{n} and its speedup is S=45log2m2S=\dfrac{4}{5} log_{2}{m^2} over the classic O(nlogn)O(n \log n) parallel reduction algorithm. Experimental performance results show that the proposed reduction method is 3.2×\sim 3.2 \times faster than a conventional GPU reduction implementation, and preserves the numerical precision because the sub-results of each chain of RR MMAs is kept as a 32-bit floating point value, before being all reduced into as a final 32-bit result. The chained MMA design allows a flexible configuration of thread-blocks; small thread-blocks of 32 or 128 threads can still achieve maximum performance using a chain of R=4,5R=4,5 MMAs per block, while large thread-blocks work best with R=1R=1. The results obtained in this work show that tensor cores can indeed provide a significant performance improvement to non-Machine Learning applications such as the arithmetic reduction, which is an integration tool for studying many scientific phenomena.

Keywords

Cite

@article{arxiv.2001.05585,
  title  = {GPU Tensor Cores for fast Arithmetic Reductions},
  author = {Cristóbal A. Navarro and Roberto Carrasco and Ricardo J. Barrientos and Javier A. Riquelme and Raimundo Vega},
  journal= {arXiv preprint arXiv:2001.05585},
  year   = {2020}
}

Comments

14 pages, 11 figures