Related papers: KernelBlaster: Continual Cross-Task CUDA Optimizat…

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive…

Machine Learning · Computer Science 2026-03-02 Weinan Dai , Hanlin Wu , Qiying Yu , Huan-ang Gao , Jiahao Li , Chengquan Jiang , Weiqiang Lou , Yufan Song , Hongli Yu , Jiaze Chen , Wei-Ying Ma , Ya-Qin Zhang , Jingjing Liu , Mingxuan Wang , Xin Liu , Hao Zhou

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Recent advances in large language models (LLMs) demonstrate their effectiveness in scaling test-time compute for software engineering tasks. However, these approaches often focus on high-level solutions, with limited attention to optimizing…

Software Engineering · Computer Science 2025-09-19 Robert Tjarko Lange , Qi Sun , Aaditya Prasad , Maxence Faldor , Yujin Tang , David Ha

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Improving GPU kernel efficiency is crucial for advancing AI systems. Recent work has explored leveraging large language models (LLMs) for GPU kernel generation and optimization. However, existing LLM-based kernel optimization pipelines…

Machine Learning · Computer Science 2026-03-12 Qitong Sun , Jun Han , Tianlin Li , Zhe Tang , Sheng Chen , Fei Yang , Aishan Liu , Xianglong Liu , Yang Liu

A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization

GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially alleviate this issue, achieving…

Computation and Language · Computer Science 2026-01-26 Qiuyi Qu , Yicheng Sui , Yufei Sun , Rui Chen , Xiaofei Zhang , Yuzhi Zhang , Haofeng Wang , Ge Lan

Astra: A Multi-Agent System for GPU Kernel Performance Optimization

GPU kernel optimization has long been a central challenge at the intersection of high-performance computing and machine learning. Efficient kernels are crucial for accelerating large language model (LLM) training and serving, yet attaining…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-04 Anjiang Wei , Tianran Sun , Yogesh Seenichamy , Hang Song , Anne Ouyang , Azalia Mirhoseini , Ke Wang , Alex Aiken

STARK: Strategic Team of Agents for Refining Kernels

The efficiency of GPU kernels is central to the progress of modern AI, yet optimizing them remains a difficult and labor-intensive task due to complex interactions between memory hierarchies, thread scheduling, and hardware-specific…

Artificial Intelligence · Computer Science 2025-10-21 Juncheng Dong , Yang Yang , Tao Liu , Yang Wang , Feng Qi , Vahid Tarokh , Kaushik Rangadurai , Shuang Yang

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

The exponential growth in demand for GPU computing resources has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models achieve low success…

Artificial Intelligence · Computer Science 2026-02-04 Xiaoya Li , Xiaofei Sun , Albert Wang , Jiwei Li , Chris Shum

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

Modern machine learning (ML) workloads increasingly rely on GPUs, yet achieving high end-to-end performance remains challenging due to dependencies on both GPU kernel efficiency and host-side settings. Although LLM-based methods show…

Multiagent Systems · Computer Science 2026-03-04 Shiyang Li , Zijian Zhang , Winson Chen , Yuebo Luo , Mingyi Hong , Caiwen Ding

Making LLMs Optimize Multi-Scenario CUDA Kernels Like Experts

Optimizing GPU kernels manually is a challenging and time-consuming task. With the rapid development of LLMs, automated GPU kernel optimization is gradually becoming a tangible reality. However, current LLM-driven automated optimization…

Machine Learning · Computer Science 2026-03-10 Yuxuan Han , Meng-Hao Guo , Zhengning Liu , Wenguang Chen , Shi-Min Hu

CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning

Large language models (LLMs) are remarked by their substantial computational requirements. To mitigate the cost, researchers develop specialized CUDA kernels, which often fuse several tensor operations to maximize the utilization of GPUs as…

Hardware Architecture · Computer Science 2025-01-15 Guoliang He , Eiko Yoneki

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels

Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation. However, generating the code which is deeply hardware-specific, architecture-aware, and performance-critical, especially for massively…

Machine Learning · Computer Science 2025-06-12 Wentao Chen , Jiace Zhu , Qi Fan , Yehan Ma , An Zou

Improving Efficiency of GPU Kernel Optimization Agents using a Domain-Specific Language and Speed-of-Light Guidance

Optimizing GPU kernels with LLM agents is an iterative process over a large design space. Every candidate must be generated, compiled, validated, and profiled, so fewer trials will save both runtime and cost. We make two key observations.…

Machine Learning · Computer Science 2026-04-01 Siva Kumar Sastry Hari , Vignesh Balaji , Sana Damani , Qijing Huang , Christos Kozyrakis

Memory-Augmented Reinforcement Learning Agent for CAD Generation

Automatic generation of computer-aided design (CAD) models is a core technology for enabling intelligence in advanced manufacturing. Existing generation methods based on large language models (LLMs) often fall short when handling complex…

Artificial Intelligence · Computer Science 2026-05-20 Yin Xiaolong , Liu Yu , Shen Jiahang , Lu Xingyu , Ni Jingzhe , Fan Fengxiao , Sang Fan

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the…

Machine Learning · Computer Science 2025-12-15 Songqiao Su , Xiaofei Sun , Xiaoya Li , Albert Wang , Jiwei Li , Chris Shum

EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models

CUDA kernel optimization has become a critical bottleneck for AI performance, as deep learning training and inference efficiency directly depends on highly optimized GPU kernels. Despite the promise of Large Language Models (LLMs) for…

Machine Learning · Computer Science 2025-10-07 Ping Guo , Chenyu Zhu , Siyuan Chen , Fei Liu , Xi Lin , Zhichao Lu , Qingfu Zhang

GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization

Optimizing GPU kernels for high performance is a complex task, often demanding deep architectural knowledge, extensive profiling, and iterative experimentation. This challenge is amplified when targeting newer or less-documented GPU…

Machine Learning · Computer Science 2025-08-25 Martin Andrews , Sam Witteveen

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Developing efficient CUDA kernels is increasingly critical for AI applications such as large-scale LLM training. However, manual kernel design is both costly and time-consuming, motivating automatic approaches that leverage LLMs for code…

Machine Learning · Computer Science 2025-11-06 Zijian Zhang , Rong Wang , Shiyang Li , Yuebo Luo , Mingyi Hong , Caiwen Ding

CuBridge: An LLM-Based Framework for Understanding and Reconstructing High-Performance Attention Kernels

Efficient CUDA implementations of attention mechanisms are critical to modern deep learning systems, yet supporting diverse and evolving attention variants remains challenging. Existing frameworks and compilers trade performance for…

Machine Learning · Computer Science 2026-05-07 Xing Ma , Yangjie Zhou , Wu Sun , Zihan Liu , Jingwen Leng , Yun Lin , Shixuan Sun , Minyi Guo , Jin Song Dong

Agentic Code Optimization via Compiler-LLM Cooperation

Generating performant executables from high level languages is critical to software performance across a wide range of domains. Modern compilers perform this task by passing code through a series of well-studied optimizations at…

Programming Languages · Computer Science 2026-04-07 Benjamin Mikek , Danylo Vashchilenko , Bryan Lu , Panpan Xu

KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

New AI accelerators with novel instruction set architectures (ISAs) often require developers to manually craft low-level kernels -- a time-consuming, laborious, and error-prone process that cannot scale across diverse hardware targets. This…

Hardware Architecture · Computer Science 2026-03-11 Jiayi Nie , Haoran Wu , Yao Lai , Zeyu Cao , Cheng Zhang , Binglei Lou , Erwei Wang , Jianyi Cheng , Timothy M. Jones , Robert Mullins , Rika Antonova , Yiren Zhao