English

Task-Based Tensor Computations on Modern GPUs

Programming Languages 2025-04-10 v1

Abstract

Domain-specific, fixed-function units are becoming increasingly common in modern processors. As the computational demands of applications evolve, the capabilities and programming interfaces of these fixed-function units continue to change. NVIDIA's Hopper GPU architecture contains multiple fixed-function units per compute unit, including an asynchronous data movement unit (TMA) and an asynchronous matrix multiplication unit (Tensor Core). Efficiently utilizing these units requires a fundamentally different programming style than previous architectures; programmers must now develop warp-specialized kernels that orchestrate producer-consumer pipelines between the asynchronous units. To manage the complexity of programming these new architectures, we introduce Cypress, a task-based programming model with sequential semantics. Cypress programs are a set of designated functions called \emph{tasks} that operate on \emph{tensors} and are free of communication and synchronization. Cypress programs are bound to the target machine through a \emph{mapping} specification that describes where tasks should run and in which memories tensors should be materialized. We present a compiler architecture that lowers Cypress programs into CUDA programs that perform competitively with expert-written codes. Cypress achieves 0.88x-1.06x the performance of cuBLAS on GEMM, and between 0.80x-0.98x the performance of the currently best-known Flash Attention implementation while eliminating all aspects of explicit data movement and asynchronous computation from application code.

Keywords

Cite

@article{arxiv.2504.07004,
  title  = {Task-Based Tensor Computations on Modern GPUs},
  author = {Rohan Yadav and Michael Garland and Alex Aiken and Michael Bauer},
  journal= {arXiv preprint arXiv:2504.07004},
  year   = {2025}
}
R2 v1 2026-06-28T22:52:31.662Z