Related papers: KernelFoundry: Hardware-aware evolutionary GPU ker…

KernelBench: Can LLMs Write Efficient GPU Kernels?

Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate…

Machine Learning · Computer Science 2025-02-18 Anne Ouyang , Simon Guo , Simran Arora , Alex L. Zhang , William Hu , Christopher Ré , Azalia Mirhoseini

GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization

Optimizing GPU kernels for high performance is a complex task, often demanding deep architectural knowledge, extensive profiling, and iterative experimentation. This challenge is amplified when targeting newer or less-documented GPU…

Machine Learning · Computer Science 2025-08-25 Martin Andrews , Sam Witteveen

GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs

In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled and executed cheaply, which fails in large…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-30 Ruifan Chu , Anbang Wang , Xiuxiu Bai , Shuai Liu , Xiaoshe Dong

STARK: Strategic Team of Agents for Refining Kernels

The efficiency of GPU kernels is central to the progress of modern AI, yet optimizing them remains a difficult and labor-intensive task due to complex interactions between memory hierarchies, thread scheduling, and hardware-specific…

Artificial Intelligence · Computer Science 2025-10-21 Juncheng Dong , Yang Yang , Tao Liu , Yang Wang , Feng Qi , Vahid Tarokh , Kaushik Rangadurai , Shuang Yang

SwizzlePerf: Hardware-Aware LLMs for GPU Kernel Performance Optimization

Large language models (LLMs) have shown progress in GPU kernel performance engineering using inefficient search-based methods that optimize around runtime. Any existing approach lacks a key characteristic that human performance engineers…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-29 Arya Tschand , Muhammad Awad , Ryan Swann , Kesavan Ramakrishnan , Jeffrey Ma , Keith Lowery , Ganesh Dasika , Vijay Janapa Reddi

MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation

The automatic generation of deep learning (DL) kernels using large language models (LLMs) has emerged as a promising approach to reduce the manual effort and hardware-specific expertise required for writing high-performance operator…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-29 Zhongzhen Wen , Yinghui Zhang , Zhong Li , Zhongxin Liu , Linna Xie , Tian Zhang

K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model

Optimizing GPU kernels is critical for efficient modern machine learning systems yet remains challenging due to the complex interplay of design factors and rapid hardware evolution. Existing automated approaches typically treat Large…

Artificial Intelligence · Computer Science 2026-02-27 Shiyi Cao , Ziming Mao , Joseph E. Gonzalez , Ion Stoica

FastKernels: Benchmarking GPU Kernel Generation in Production

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they…

Machine Learning · Computer Science 2026-05-25 Gabriele Oliaro , Yichao Fu , May Jiang , Owen Lu , Junli Wang , Zhihao Jia , Hao Zhang , Samyam Rajbhandari

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Improving GPU kernel efficiency is crucial for advancing AI systems. Recent work has explored leveraging large language models (LLMs) for GPU kernel generation and optimization. However, existing LLM-based kernel optimization pipelines…

Machine Learning · Computer Science 2026-03-12 Qitong Sun , Jun Han , Tianlin Li , Zhe Tang , Sheng Chen , Fei Yang , Aishan Liu , Xianglong Liu , Yang Liu

Optimal Kernel Tuning Parameter Prediction using Deep Sequence Models

GPU kernels have come to the forefront of computing due to their utility in varied fields, from high-performance computing to machine learning. A typical GPU compute kernel is invoked millions, if not billions of times in a typical…

Machine Learning · Computer Science 2024-04-18 Khawir Mahmood , Jehandad Khan , Hammad Afzal

Kernel methods through the roof: handling billions of points efficiently

Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems, since na\"ive implementations scale poorly with data size. Recent advances have shown the benefits…

Machine Learning · Computer Science 2020-11-30 Giacomo Meanti , Luigi Carratino , Lorenzo Rosasco , Alessandro Rudi

KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges - model architecture diversity, kernel primitive diversity, and hardware generation…

Machine Learning · Computer Science 2026-01-21 Gang Liao , Hongsen Qin , Ying Wang , Alicia Golden , Michael Kuchnik , Yavuz Yetim , Jia Jiunn Ang , Chunli Fu , Yihan He , Samuel Hsia , Zewei Jiang , Dianshi Li , Uladzimir Pashkevich , Varna Puvvada , Feng Shi , Matt Steiner , Ruichao Xiao , Nathan Yan , Xiayu Yu , Zhou Fang , Roman Levenstein , Kunming Ho , Haishan Zhu , Alec Hammond , Richard Li , Ajit Mathews , Kaustubh Gondkar , Abdul Zainul-Abedin , Ketan Singh , Hongtao Yu , Wenyuan Chi , Barney Huang , Sean Zhang , Noah Weller , Zach Marine , Wyatt Cook , Carole-Jean Wu , Gaoxiang Liu

WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning

The rapid adoption of Large Language Models (LLMs) has made GPU inference efficiency an increasingly critical system concern. The runtime of LLM workloads is largely dominated by tile-based kernels, particularly General Matrix…

Performance · Computer Science 2026-04-14 Kaixuan Zhang , Chutong Ding , Shiyou Qian , Luping Wang , Jian Cao , Guangtao Xue , Cheng Huang , Guodong Yang , Liping Zhang

Performance portability through machine learning guided kernel selection in SYCL libraries

Automatically tuning parallel compute kernels allows libraries and frameworks to achieve performance on a wide range of hardware, however these techniques are typically focused on finding optimal kernel parameters for particular input sizes…

Performance · Computer Science 2020-09-01 John Lawson

ConCuR: Conciseness Makes State-of-the-Art Kernel Generation

GPU kernel generation by LLMs has recently experienced rapid development, leveraging test-time scaling and reinforcement learning techniques. However, a key challenge for kernel generation is the scarcity of high-quality data, as most…

Machine Learning · Computer Science 2025-10-10 Lingcheng Kong , Jiateng Wei , Hanzhang Shen , Huan Wang

Benchmarking optimization algorithms for auto-tuning GPU kernels

Recent years have witnessed phenomenal growth in the application, and capabilities of Graphical Processing Units (GPUs) due to their high parallel computation power at relatively low cost. However, writing a computationally efficient GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-05 Richard Schoonhoven , Ben van Werkhoven , Kees Joost Batenburg

PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization

Designing high-performance kernels requires expert-level tuning and a deep understanding of hardware characteristics. Recent advances in large language models (LLMs) have enabled automated kernel generation, yet most existing systems rely…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-25 Kelun Lei , Hailong Yang , Huaitao Zhang , Xin You , Kaige Zhang , Zhongzhi Luan , Yi Liu , Depei Qian

Making LLMs Optimize Multi-Scenario CUDA Kernels Like Experts

Optimizing GPU kernels manually is a challenging and time-consuming task. With the rapid development of LLMs, automated GPU kernel optimization is gradually becoming a tangible reality. However, current LLM-driven automated optimization…

Machine Learning · Computer Science 2026-03-10 Yuxuan Han , Meng-Hao Guo , Zhengning Liu , Wenguang Chen , Shi-Min Hu

A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization

GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially alleviate this issue, achieving…

Computation and Language · Computer Science 2026-01-26 Qiuyi Qu , Yicheng Sui , Yufei Sun , Rui Chen , Xiaofei Zhang , Yuzhi Zhang , Haofeng Wang , Ge Lan

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy…

Machine Learning · Computer Science 2026-04-03 Tara Saba , Anne Ouyang , Xujie Si , Fan Long