Analyzing GPU Tensor Core Potential for Fast Reductions

Roberto Carrasco; Raimundo Vega; Cristóbal A. Navarro

doi:10.29007/zlmg

Analyzing GPU Tensor Core Potential for Fast Reductions

Distributed, Parallel, and Cluster Computing 2019-03-12 v1

Authors: Roberto Carrasco , Raimundo Vega , Cristóbal A. Navarro

View on arXiv ↗ PDF ↗ DOI ↗

Abstract

The Nvidia GPU architecture has introduced new computing elements such as the \textit{tensor cores}, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate \textit{Deep Learning} applications. In this work we present the idea of using tensor cores for a different purpose such as the parallel arithmetic reduction problem, and propose a new GPU tensor-core based algorithm as well as analyze its potential performance benefits in comparison to a traditional GPU-based one. The proposed method, encodes the reduction of $n$ numbers as a set of $m\times m$ MMA tensor-core operations (for Nvidia's Volta architecture $m=16$ ) and takes advantage from the fact that each MMA operation takes just one GPU cycle. When analyzing the cost under a simplified GPU computing model, the result is that the new algorithm manages to reduce a problem of $n$ numbers in $T(n) = 5\log_{m^2}(n)$ steps with a speedup of $S = \frac{4}{5}\log_2(m^2)$ .

Keywords

gpu computing parallel algorithm fpga accelerator

Cite

@article{arxiv.1903.03640,
  title  = {Analyzing GPU Tensor Core Potential for Fast Reductions},
  author = {Roberto Carrasco and Raimundo Vega and Cristóbal A. Navarro},
  journal= {arXiv preprint arXiv:1903.03640},
  year   = {2019}
}

Comments

This paper was presented in the SCCC 2018 Conference, November 5

Analyzing GPU Tensor Core Potential for Fast Reductions

Abstract

Keywords

Cite

Comments

Related papers