Related papers: GPU Performance Portability needs Autotuning

A Few Fit Most: Improving Performance Portability of SGEMM on GPUs using Multi-Versioning

Hand-optimizing linear algebra kernels for different GPU devices and applications is complex and labor-intensive. Instead, many developers use automatic performance tuning (autotuning) to achieve high performance on a variety of devices.…

Programming Languages · Computer Science 2025-07-22 Robert Hochgraf , Sreepathi Pai

A Benchmark Set of Highly-efficient CUDA and OpenCL Kernels and its Dynamic Autotuning with Kernel Tuning Toolkit

Autotuning of performance-relevant source-code parameters allows to automatically tune applications without hard coding optimizations and thus helps with keeping the performance portable. In this paper, we introduce a benchmark set of ten…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-03-02 Filip Petrovič , David Střelák , Jana Hozzová , Jaroslav Oľha , Richard Trembecký , Siegfried Benkner , Jiří Filipovič

Software Autotuning for Sustainable Performance Portability

Scientific software applications are increasingly developed by large interdiscplinary teams operating on functional modules organized around a common software framework, which is capable of integrating new functional capabilities without…

Performance · Computer Science 2013-09-10 Azamat Mametjanov , Boyana Norris

WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning

The rapid adoption of Large Language Models (LLMs) has made GPU inference efficiency an increasingly critical system concern. The runtime of LLM workloads is largely dominated by tile-based kernels, particularly General Matrix…

Performance · Computer Science 2026-04-14 Kaixuan Zhang , Chutong Ding , Shiyou Qian , Luping Wang , Jian Cao , Guangtao Xue , Cheng Huang , Guodong Yang , Liping Zhang

Towards a Benchmarking Suite for Kernel Tuners

As computing system become more complex, it is becoming harder for programmers to keep their codes optimized as the hardware gets updated. Autotuners try to alleviate this by hiding as many architecture-based optimization details as…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-03-17 Jacob O. Tørring , Ben van Werkhoven , Filip Petrovic , Floris-Jan Willemsen , Jirí Filipovic , Anne C. Elster

Machine Learning Based Auto-tuning for Enhanced OpenCL Performance Portability

Heterogeneous computing, which combines devices with different architectures, is rising in popularity, and promises increased performance combined with reduced energy consumption. OpenCL has been proposed as a standard for programing such…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-15 Thomas L. Falch , Anne C. Elster

The Anatomy of a Triton Attention Kernel

A long-standing goal in both industry and academia is to develop an LLM inference platform that is portable across hardware architectures, eliminates the need for low-level hand-tuning, and still delivers best-in-class efficiency. In this…

Machine Learning · Computer Science 2025-11-18 Burkhard Ringlein , Jan van Lunteren , Radu Stoica , Thomas Parnell

Just-in-Time autotuning

Performance portability is a major concern on current architectures. One way to achieve it is by using autotuning. In this paper, we are presenting how we exten ded a just-in-time compilation infrastructure to introduce autotuning…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-13 Elian Morel , Camille Coti

GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs

In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled and executed cheaply, which fails in large…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-30 Ruifan Chu , Anbang Wang , Xiuxiu Bai , Shuai Liu , Xiaoshe Dong

Benchmarking optimization algorithms for auto-tuning GPU kernels

Recent years have witnessed phenomenal growth in the application, and capabilities of Graphical Processing Units (GPUs) due to their high parallel computation power at relatively low cost. However, writing a computationally efficient GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-05 Richard Schoonhoven , Ben van Werkhoven , Kees Joost Batenburg

Going green: optimizing GPUs for energy efficiency through model-steered auto-tuning

Graphics Processing Units (GPUs) have revolutionized the computing landscape over the past decade. However, the growing energy demands of data centres and computing facilities equipped with GPUs come with significant capital and…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-15 Richard Schoonhoven , Bram Veenboer , Ben van Werkhoven , Kees Joost Batenburg

CLTune: A Generic Auto-Tuner for OpenCL Kernels

This work presents CLTune, an auto-tuner for OpenCL kernels. It evaluates and tunes kernel performance of a generic, user-defined search space of possible parameter-value combinations. Example parameters include the OpenCL workgroup size,…

Performance · Computer Science 2017-05-15 Cedric Nugteren , Valeriu Codreanu

PortLLM: Personalizing Evolving Large Language Models with Training-Free and Portable Model Patches

As large language models (LLMs) increasingly shape the AI landscape, fine-tuning pretrained models has become more popular than in the pre-LLM era for achieving optimal performance in domain-specific tasks. However, pretrained LLMs such as…

Computation and Language · Computer Science 2025-04-01 Rana Muhammad Shahroz Khan , Pingzhi Li , Sukwon Yun , Zhenyu Wang , Shahriar Nirjon , Chau-Wai Wong , Tianlong Chen

Using hardware performance counters to speed up autotuning convergence on GPUs

Nowadays, GPU accelerators are commonly used to speed up general-purpose computing tasks on a variety of hardware. However, due to the diversity of GPU architectures and processed data, optimization of codes for a particular type of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-20 Jiří Filipovič , Jana Hozzová , Amin Nezarat , Jaroslav Oľha , Filip Petrovič

Can Large Language Models Predict Parallel Code Performance?

Accurate determination of the performance of parallel GPU code typically requires execution-time profiling on target hardware -- an increasingly prohibitive step due to limited access to high-end GPUs. This paper explores whether Large…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-08 Gregory Bolet , Giorgis Georgakoudis , Harshitha Menon , Konstantinos Parasyris , Niranjan Hasabnis , Hayden Estes , Kirk W. Cameron , Gal Oren

GPT Carry-On: Training Foundation Model for Customization Could Be Simple, Scalable and Affordable

Modern large language foundation models (LLM) have now entered the daily lives of millions of users. We ask a natural question whether it is possible to customize LLM for every user or every task. From system and industrial economy…

Machine Learning · Computer Science 2025-04-11 Jianqiao Wangni

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

As inference on Large Language Models (LLMs) emerges as an important workload in machine learning applications, weight quantization has become a standard technique for efficient GPU deployment. Quantization not only reduces model size, but…

Machine Learning · Computer Science 2024-08-22 Elias Frantar , Roberto L. Castro , Jiale Chen , Torsten Hoefler , Dan Alistarh

Performance portability through machine learning guided kernel selection in SYCL libraries

Automatically tuning parallel compute kernels allows libraries and frameworks to achieve performance on a wide range of hardware, however these techniques are typically focused on finding optimal kernel parameters for particular input sizes…

Performance · Computer Science 2020-09-01 John Lawson

APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration

Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance. Quantization methods can help reduce computational costs, however, attaining the…

Machine Learning · Computer Science 2025-09-03 Shaobo Ma , Chao Fang , Haikuo Shao , Zhongfeng Wang

Automated Algorithm Design for Auto-Tuning Optimizers

Automatic performance tuning (auto-tuning) is essential for optimizing high-performance applications, where vast and irregular search spaces make manual exploration infeasible. While auto-tuners traditionally rely on classical approaches…

Machine Learning · Computer Science 2026-04-01 Floris-Jan Willemsen , Niki van Stein , Ben van Werkhoven