Related papers: PRAGMA: A Profiling-Reasoned Multi-Agent Framework…

STARK: Strategic Team of Agents for Refining Kernels

The efficiency of GPU kernels is central to the progress of modern AI, yet optimizing them remains a difficult and labor-intensive task due to complex interactions between memory hierarchies, thread scheduling, and hardware-specific…

Artificial Intelligence · Computer Science 2025-10-21 Juncheng Dong , Yang Yang , Tao Liu , Yang Wang , Feng Qi , Vahid Tarokh , Kaushik Rangadurai , Shuang Yang

GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs

In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled and executed cheaply, which fails in large…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-30 Ruifan Chu , Anbang Wang , Xiuxiu Bai , Shuai Liu , Xiaoshe Dong

Towards Automated Kernel Generation in the Era of LLMs

The performance of modern AI systems is fundamentally constrained by the quality of their underlying kernels, which translate high-level algorithmic semantics into low-level hardware operations. Achieving near-optimal kernels requires…

Machine Learning · Computer Science 2026-01-27 Yang Yu , Peiyu Zang , Chi Hsu Tsai , Haiming Wu , Yixin Shen , Jialing Zhang , Haoyu Wang , Zhiyou Xiao , Jingze Shi , Yuyu Luo , Wentao Zhang , Chunlei Men , Guang Liu , Yonghua Lin

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Improving GPU kernel efficiency is crucial for advancing AI systems. Recent work has explored leveraging large language models (LLMs) for GPU kernel generation and optimization. However, existing LLM-based kernel optimization pipelines…

Machine Learning · Computer Science 2026-03-12 Qitong Sun , Jun Han , Tianlin Li , Zhe Tang , Sheng Chen , Fei Yang , Aishan Liu , Xianglong Liu , Yang Liu

Astra: A Multi-Agent System for GPU Kernel Performance Optimization

GPU kernel optimization has long been a central challenge at the intersection of high-performance computing and machine learning. Efficient kernels are crucial for accelerating large language model (LLM) training and serving, yet attaining…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-04 Anjiang Wei , Tianran Sun , Yogesh Seenichamy , Hang Song , Anne Ouyang , Azalia Mirhoseini , Ke Wang , Alex Aiken

GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization

Optimizing GPU kernels for high performance is a complex task, often demanding deep architectural knowledge, extensive profiling, and iterative experimentation. This challenge is amplified when targeting newer or less-documented GPU…

Machine Learning · Computer Science 2025-08-25 Martin Andrews , Sam Witteveen

AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving

Optimizing Large Language Model (LLM) inference in production systems is increasingly difficult due to dynamic workloads, stringent latency/throughput targets, and a rapidly expanding configuration space. This complexity spans not only…

Machine Learning · Computer Science 2026-01-13 Tianhao Xu , Yiming Liu , Xianglong Lu , Yijia Zhao , Xuting Zhou , Aichen Feng , Yiyi Chen , Yi Shen , Qin Zhou , Xumeng Chen , Ilya Sherstyuk , Haorui Li , Rishi Thakkar , Ben Hamm , Yuanzhe Li , Xue Huang , Wenpeng Wu , Anish Shanbhag , Harry Kim , Chuan Chen , Junjie Lai

KernelFoundry: Hardware-aware evolutionary GPU kernel optimization

Optimizing GPU kernels presents a significantly greater challenge for large language models (LLMs) than standard code generation tasks, as it requires understanding hardware architecture, parallel optimization strategies, and performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-16 Nina Wiedemann , Quentin Leboutet , Michael Paulitsch , Diana Wofk , Benjamin Ummenhofer

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels

Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation. However, generating the code which is deeply hardware-specific, architecture-aware, and performance-critical, especially for massively…

Machine Learning · Computer Science 2025-06-12 Wentao Chen , Jiace Zhu , Qi Fan , Yehan Ma , An Zou

PEAK: A Performance Engineering AI-Assistant for GPU Kernels Powered by Natural Language Transformations

Advancements in large language models (LLMs) are showing promising impact in software development and programming assistance. However, these models struggle when operating on low-level backend code. This challenge is exacerbated in the…

Software Engineering · Computer Science 2025-12-23 Muhammad Usman Tariq , Abhinav Jangda , Angelica Moreira , Madan Musuvathi , Tyler Sorensen

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy…

Machine Learning · Computer Science 2026-04-03 Tara Saba , Anne Ouyang , Xujie Si , Fan Long

TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization

High-performance GPU kernel optimization remains a critical yet labor-intensive task in modern machine learning workloads. Although Triton, a domain-specific language for GPU programming, enables developers to write efficient kernels with…

Software Engineering · Computer Science 2025-12-16 Haonan Li , Keyu Man , Partha Kanuparthy , Hanning Chen , Wei Sun , Sreen Tallam , Chenguang Zhu , Kevin Zhu , Zhiyun Qian

Optimas: An Intelligent Analytics-Informed Generative AI Framework for Performance Optimization

Large language models (LLMs) show promise for automated code optimization. However, without performance context, they struggle to produce correct and effective code transformations. Existing performance tools can identify bottlenecks but…

Performance · Computer Science 2026-04-28 Mohammad Zaeed , Tanzima Z. Islam , Vladimir Indic

Efficient Kernel Mapping and Comprehensive System Evaluation of LLM Acceleration on a CGLA

Large Language Models (LLMs) demand substantial computational resources, resulting in high energy consumption on GPUs. To address this challenge, we focus on Coarse-Grained Reconfigurable Arrays (CGRAs) as an effective alternative that…

Hardware Architecture · Computer Science 2025-12-02 Takuto Ando , Yu Eto , Ayumu Takeuchi , Yasuhiko Nakashima

Can Large Language Models Predict Parallel Code Performance?

Accurate determination of the performance of parallel GPU code typically requires execution-time profiling on target hardware -- an increasingly prohibitive step due to limited access to high-end GPUs. This paper explores whether Large…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-08 Gregory Bolet , Giorgis Georgakoudis , Harshitha Menon , Konstantinos Parasyris , Niranjan Hasabnis , Hayden Estes , Kirk W. Cameron , Gal Oren

KForge: Program Synthesis for Diverse AI Hardware Accelerators

GPU kernels are critical for ML performance but difficult to optimize across diverse accelerators. We present KForge, a platform-agnostic framework built on two collaborative LLM-based agents: a generation agent that produces and…

Machine Learning · Computer Science 2025-11-18 Taras Sereda , Tom St. John , Burak Bartan , Natalie Serrino , Sachin Katti , Zain Asgar

Automatic Hardware Pragma Insertion in High-Level Synthesis: A Non-Linear Programming Approach

High-Level Synthesis enables the rapid prototyping of hardware accelerators, by combining a high-level description of the functional behavior of a kernel with a set of micro-architecture optimizations as inputs. Such optimizations can be…

Hardware Architecture · Computer Science 2025-02-11 Stéphane Pouget , Louis-Noël Pouchet , Jason Cong

KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

New AI accelerators with novel instruction set architectures (ISAs) often require developers to manually craft low-level kernels -- a time-consuming, laborious, and error-prone process that cannot scale across diverse hardware targets. This…

Hardware Architecture · Computer Science 2026-03-11 Jiayi Nie , Haoran Wu , Yao Lai , Zeyu Cao , Cheng Zhang , Binglei Lou , Erwei Wang , Jianyi Cheng , Timothy M. Jones , Robert Mullins , Rika Antonova , Yiren Zhao

KernelBench: Can LLMs Write Efficient GPU Kernels?

Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate…

Machine Learning · Computer Science 2025-02-18 Anne Ouyang , Simon Guo , Simran Arora , Alex L. Zhang , William Hu , Christopher Ré , Azalia Mirhoseini

FastKernels: Benchmarking GPU Kernel Generation in Production

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they…

Machine Learning · Computer Science 2026-05-25 Gabriele Oliaro , Yichao Fu , May Jiang , Owen Lu , Junli Wang , Zhihao Jia , Hao Zhang , Samyam Rajbhandari