English

GPU Performance Portability needs Autotuning

Hardware Architecture 2025-07-18 v3 Artificial Intelligence Programming Languages

Abstract

As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today's reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with comprehensive kernel parameter autotuning to enable portable LLM inference with state-of-the-art performance without code changes. Focusing on performance-critical LLM kernels, we demonstrate that this approach explores up to 15x more kernel parameter configurations, produces significantly more diverse code across multiple dimensions, and even outperforms vendor-optimized implementations by up to 230%, all while reducing kernel code size by 70x and eliminating manual code optimizations. Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors.

Keywords

Cite

@article{arxiv.2505.03780,
  title  = {GPU Performance Portability needs Autotuning},
  author = {Burkhard Ringlein and Thomas Parnell and Radu Stoica},
  journal= {arXiv preprint arXiv:2505.03780},
  year   = {2025}
}

Comments

revision after reviewers feedback, broadening autotune study