Related papers: GPU Kernel Optimization Beyond Full Builds: An LLM…

GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization

Optimizing GPU kernels for high performance is a complex task, often demanding deep architectural knowledge, extensive profiling, and iterative experimentation. This challenge is amplified when targeting newer or less-documented GPU…

Machine Learning · Computer Science 2025-08-25 Martin Andrews , Sam Witteveen

Making LLMs Optimize Multi-Scenario CUDA Kernels Like Experts

Optimizing GPU kernels manually is a challenging and time-consuming task. With the rapid development of LLMs, automated GPU kernel optimization is gradually becoming a tangible reality. However, current LLM-driven automated optimization…

Machine Learning · Computer Science 2026-03-10 Yuxuan Han , Meng-Hao Guo , Zhengning Liu , Wenguang Chen , Shi-Min Hu

KernelBench: Can LLMs Write Efficient GPU Kernels?

Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate…

Machine Learning · Computer Science 2025-02-18 Anne Ouyang , Simon Guo , Simran Arora , Alex L. Zhang , William Hu , Christopher Ré , Azalia Mirhoseini

Can Large Language Models Predict Parallel Code Performance?

Accurate determination of the performance of parallel GPU code typically requires execution-time profiling on target hardware -- an increasingly prohibitive step due to limited access to high-end GPUs. This paper explores whether Large…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-08 Gregory Bolet , Giorgis Georgakoudis , Harshitha Menon , Konstantinos Parasyris , Niranjan Hasabnis , Hayden Estes , Kirk W. Cameron , Gal Oren

STARK: Strategic Team of Agents for Refining Kernels

The efficiency of GPU kernels is central to the progress of modern AI, yet optimizing them remains a difficult and labor-intensive task due to complex interactions between memory hierarchies, thread scheduling, and hardware-specific…

Artificial Intelligence · Computer Science 2025-10-21 Juncheng Dong , Yang Yang , Tao Liu , Yang Wang , Feng Qi , Vahid Tarokh , Kaushik Rangadurai , Shuang Yang

Counting Without Running: Evaluating LLMs' Reasoning About Code Complexity

Modern GPU software stacks demand developers who can anticipate performance bottlenecks before ever launching a kernel; misjudging floating-point workloads upstream can derail tuning, scheduling, and even hardware procurement. Yet despite…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-05 Gregory Bolet , Giorgis Georgakoudis , Konstantinos Parasyris , Harshitha Menon , Niranjan Hasabnis , Kirk W. Cameron , Gal Oren

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy…

Machine Learning · Computer Science 2026-04-03 Tara Saba , Anne Ouyang , Xujie Si , Fan Long

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Improving GPU kernel efficiency is crucial for advancing AI systems. Recent work has explored leveraging large language models (LLMs) for GPU kernel generation and optimization. However, existing LLM-based kernel optimization pipelines…

Machine Learning · Computer Science 2026-03-12 Qitong Sun , Jun Han , Tianlin Li , Zhe Tang , Sheng Chen , Fei Yang , Aishan Liu , Xianglong Liu , Yang Liu

Omniwise: Predicting GPU Kernels Performance with LLMs

In recent years, the rapid advancement of deep neural networks (DNNs) has revolutionized artificial intelligence, enabling models with unprecedented capabilities in understanding, generating, and processing complex data. These powerful…

Machine Learning · Computer Science 2025-06-27 Zixian Wang , Cole Ramos , Muhammad A. Awad , Keith Lowery

QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation

Developing high-performance GPU kernels is critical for AI and scientific computing, but remains challenging due to its reliance on expert crafting and poor portability. While LLMs offer promise for automation, both general-purpose and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-26 Xinguo Zhu , Shaohui Peng , Jiaming Guo , Yunji Chen , Qi Guo , Yuanbo Wen , Hang Qin , Ruizhi Chen , Qirui Zhou , Ke Gao , Yanjun Wu , Chen Zhao , Ling Li

PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization

Designing high-performance kernels requires expert-level tuning and a deep understanding of hardware characteristics. Recent advances in large language models (LLMs) have enabled automated kernel generation, yet most existing systems rely…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-25 Kelun Lei , Hailong Yang , Huaitao Zhang , Xin You , Kaige Zhang , Zhongzhi Luan , Yi Liu , Depei Qian

LLM-Powered Code Analysis and Optimization for Gaussian Splatting Kernels

3D Gaussian splatting (3DGS) is a transformative technique with profound implications on novel view synthesis and real-time rendering. Given its importance, there have been many attempts to improve its performance. However, with the…

Hardware Architecture · Computer Science 2025-10-14 Yi Hu , Huiyang Zhou

KernelFoundry: Hardware-aware evolutionary GPU kernel optimization

Optimizing GPU kernels presents a significantly greater challenge for large language models (LLMs) than standard code generation tasks, as it requires understanding hardware architecture, parallel optimization strategies, and performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-16 Nina Wiedemann , Quentin Leboutet , Michael Paulitsch , Diana Wofk , Benjamin Ummenhofer

Kernel methods through the roof: handling billions of points efficiently

Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems, since na\"ive implementations scale poorly with data size. Recent advances have shown the benefits…

Machine Learning · Computer Science 2020-11-30 Giacomo Meanti , Luigi Carratino , Lorenzo Rosasco , Alessandro Rudi

GPU Performance Portability needs Autotuning

As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today's reliance on a single dominant platform limits portability, creates vendor lock-in, and raises…

Hardware Architecture · Computer Science 2025-07-18 Burkhard Ringlein , Thomas Parnell , Radu Stoica

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels

Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation. However, generating the code which is deeply hardware-specific, architecture-aware, and performance-critical, especially for massively…

Machine Learning · Computer Science 2025-06-12 Wentao Chen , Jiace Zhu , Qi Fan , Yehan Ma , An Zou

Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

Training Large Language Models(LLMs) is one of the most compute-intensive tasks in high-performance computing. Predicting end-to-end training time for multi-billion parameter models distributed across hundreds of GPUs remains challenging…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-30 Biyao Zhang , Mingkai Zheng , Debargha Ganguly , Xuecen Zhang , Vikash Singh , Vipin Chaudhary , Zhao Zhang

Benchmarking optimization algorithms for auto-tuning GPU kernels

Recent years have witnessed phenomenal growth in the application, and capabilities of Graphical Processing Units (GPUs) due to their high parallel computation power at relatively low cost. However, writing a computationally efficient GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-05 Richard Schoonhoven , Ben van Werkhoven , Kees Joost Batenburg

KernelBand: Steering LLM-based Kernel Optimization via Hardware-Aware Multi-Armed Bandits

High-performance GPU kernels are critical for efficient LLM serving, yet their optimization remains a bottleneck requiring deep system expertise. While code LLMs show promise in generating functionally correct code, kernel optimization is…

Machine Learning · Computer Science 2026-02-12 Dezhi Ran , Shuxiao Xie , Mingfang Ji , Anmin Liu , Mengzhou Wu , Yuan Cao , Yuzhe Guo , Hao Yu , Linyi Li , Yitao Hu , Wei Yang , Tao Xie

MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation

The automatic generation of deep learning (DL) kernels using large language models (LLMs) has emerged as a promising approach to reduce the manual effort and hardware-specific expertise required for writing high-performance operator…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-29 Zhongzhen Wen , Yinghui Zhang , Zhong Li , Zhongxin Liu , Linna Xie , Tian Zhang