English
Related papers

Related papers: AutoKernel: Autonomous GPU Kernel Optimization via…

200 papers

Recent years have witnessed phenomenal growth in the application, and capabilities of Graphical Processing Units (GPUs) due to their high parallel computation power at relatively low cost. However, writing a computationally efficient GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-05 Richard Schoonhoven , Ben van Werkhoven , Kees Joost Batenburg

GPU kernel optimization is increasingly critical for efficient deep learning systems, but writing high-performance kernels still requires substantial low-level expertise. Recent AI coding agents can iteratively read code, invoke compilers…

Kernel development in deep learning requires optimizing computational units across hardware while balancing memory management, parallelism, and hardware-specific optimizations through extensive empirical tuning. Although domain-specific…

Machine Learning · Computer Science 2025-07-09 Shangzhan Li , Zefan Wang , Ye He , Yuxuan Li , Qi Shi , Jianling Li , Yonggang Hu , Wanxiang Che , Xu Han , Zhiyuan Liu , Maosong Sun

High-performance GPU kernel optimization remains a critical yet labor-intensive task in modern machine learning workloads. Although Triton, a domain-specific language for GPU programming, enables developers to write efficient kernels with…

Software Engineering · Computer Science 2025-12-16 Haonan Li , Keyu Man , Partha Kanuparthy , Hanning Chen , Wei Sun , Sreen Tallam , Chenguang Zhu , Kevin Zhu , Zhiyun Qian

High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy…

Machine Learning · Computer Science 2026-04-03 Tara Saba , Anne Ouyang , Xujie Si , Fan Long

Autotuning of performance-relevant source-code parameters allows to automatically tune applications without hard coding optimizations and thus helps with keeping the performance portable. In this paper, we introduce a benchmark set of ten…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-03-02 Filip Petrovič , David Střelák , Jana Hozzová , Jaroslav Oľha , Richard Trembecký , Siegfried Benkner , Jiří Filipovič

Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-08 Dominik Ernst , Markus Holzer , Georg Hager , Matthias Knorr , Gerhard Wellein

Optimizing the performance of GPU kernels is challenging for both human programmers and code generators. For example, CUDA programmers must set thread and block parameters for a kernel, but might not have the intuition to make a good…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-06-30 Robert V. Lim , Boyana Norris , Allen D. Malony

The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to…

Computation and Language · Computer Science 2025-08-01 Jianghui Wang , Vinay Joshi , Saptarshi Majumder , Xu Chao , Bin Ding , Ziqiong Liu , Pratik Prabhanjan Brahma , Dong Li , Zicheng Liu , Emad Barsoum

In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled and executed cheaply, which fails in large…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-30 Ruifan Chu , Anbang Wang , Xiuxiu Bai , Shuai Liu , Xiaoshe Dong

Automated tuning of compute kernels is a popular area of research, mainly focused on finding optimal kernel parameters for a problem with fixed input sizes. This approach is good for deploying machine learning models, where the network…

Machine Learning · Computer Science 2020-03-17 John Lawson

Modern computing systems are increasingly more complex, with their multicore CPUs and GPUs accelerators changing yearly, if not more often. It thus has become very challenging to write programs that efficiently use the associated complex…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-28 Jacob O. Tørring , Anne C. Elster

GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially alleviate this issue, achieving…

Computation and Language · Computer Science 2026-01-26 Qiuyi Qu , Yicheng Sui , Yufei Sun , Rui Chen , Xiaofei Zhang , Yuzhi Zhang , Haofeng Wang , Ge Lan

Optimizing GPU kernels for high performance is a complex task, often demanding deep architectural knowledge, extensive profiling, and iterative experimentation. This challenge is amplified when targeting newer or less-documented GPU…

Machine Learning · Computer Science 2025-08-25 Martin Andrews , Sam Witteveen

Improving GPU kernel efficiency is crucial for advancing AI systems. Recent work has explored leveraging large language models (LLMs) for GPU kernel generation and optimization. However, existing LLM-based kernel optimization pipelines…

Machine Learning · Computer Science 2026-03-12 Qitong Sun , Jun Han , Tianlin Li , Zhe Tang , Sheng Chen , Fei Yang , Aishan Liu , Xianglong Liu , Yang Liu

The rapid adoption of Large Language Models (LLMs) has made GPU inference efficiency an increasingly critical system concern. The runtime of LLM workloads is largely dominated by tile-based kernels, particularly General Matrix…

Performance · Computer Science 2026-04-14 Kaixuan Zhang , Chutong Ding , Shiyou Qian , Luping Wang , Jian Cao , Guangtao Xue , Cheng Huang , Guodong Yang , Liping Zhang

GPU kernel optimization has long been a central challenge at the intersection of high-performance computing and machine learning. Efficient kernels are crucial for accelerating large language model (LLM) training and serving, yet attaining…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-04 Anjiang Wei , Tianran Sun , Yogesh Seenichamy , Hang Song , Anne Ouyang , Azalia Mirhoseini , Ke Wang , Alex Aiken

Graphics Processing Units (GPUs) have revolutionized the computing landscape over the past decade. However, the growing energy demands of data centres and computing facilities equipped with GPUs come with significant capital and…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-15 Richard Schoonhoven , Bram Veenboer , Ben van Werkhoven , Kees Joost Batenburg

Maximizing performance on available GPU hardware is an ongoing challenge for modern AI inference systems. Traditional approaches include writing custom GPU kernels and using specialized model compilers to tune high-level code for specific…

Multiagent Systems · Computer Science 2026-05-15 Kirill Nagaitsev , Luka Grbcic , Samuel Williams , Costin Iancu

As computing system become more complex, it is becoming harder for programmers to keep their codes optimized as the hardware gets updated. Autotuners try to alleviate this by hiding as many architecture-based optimization details as…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-03-17 Jacob O. Tørring , Ben van Werkhoven , Filip Petrovic , Floris-Jan Willemsen , Jirí Filipovic , Anne C. Elster
‹ Prev 1 2 3 10 Next ›