Related papers: CudaForge: An Agent Framework with Hardware Feedba…

CUCo: An Agentic Framework for Compute and Communication Co-design

Custom CUDA kernel development is essential for maximizing GPU utilization in large-scale distributed LLM training and inference, yet manually writing kernels that jointly leverage both computation and communication remains a…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-04 Bodun Hu , Yoga Sri Varshan , Saurabh Agarwal , Aditya Akella

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive…

Machine Learning · Computer Science 2026-03-02 Weinan Dai , Hanlin Wu , Qiying Yu , Huan-ang Gao , Jiahao Li , Chengquan Jiang , Weiqiang Lou , Yufan Song , Hongli Yu , Jiaze Chen , Wei-Ying Ma , Ya-Qin Zhang , Jingjing Liu , Mingxuan Wang , Xin Liu , Hao Zhou

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy…

Machine Learning · Computer Science 2026-04-03 Tara Saba , Anne Ouyang , Xujie Si , Fan Long

cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

Optimizing CUDA kernels is a challenging and labor-intensive task, given the need for hardware-software co-design expertise and the proprietary nature of high-performance kernel libraries. While recent large language models (LLMs) combined…

Artificial Intelligence · Computer Science 2025-12-24 Jinwu Chen , Qidie Wu , Bin Li , Lin Ma , Xin Si , Yang Hu , Shouyi Yin , Jun Yang

KForge: Program Synthesis for Diverse AI Hardware Accelerators

GPU kernels are critical for ML performance but difficult to optimize across diverse accelerators. We present KForge, a platform-agnostic framework built on two collaborative LLM-based agents: a generation agent that produces and…

Machine Learning · Computer Science 2025-11-18 Taras Sereda , Tom St. John , Burak Bartan , Natalie Serrino , Sachin Katti , Zain Asgar

KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning

Optimizing CUDA code across multiple generations of GPU architectures is challenging, as achieving peak performance requires an extensive exploration of an increasingly complex, hardware-specific optimization space. Traditional compilers…

Machine Learning · Computer Science 2026-02-17 Kris Shengjun Dong , Sahil Modi , Dima Nikiforov , Sana Damani , Edward Lin , Siva Kumar Sastry Hari , Christos Kozyrakis

EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models

CUDA kernel optimization has become a critical bottleneck for AI performance, as deep learning training and inference efficiency directly depends on highly optimized GPU kernels. Despite the promise of Large Language Models (LLMs) for…

Machine Learning · Computer Science 2025-10-07 Ping Guo , Chenyu Zhu , Siyuan Chen , Fei Liu , Xi Lin , Zhichao Lu , Qingfu Zhang

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Recent advances in large language models (LLMs) demonstrate their effectiveness in scaling test-time compute for software engineering tasks. However, these approaches often focus on high-level solutions, with limited attention to optimizing…

Software Engineering · Computer Science 2025-09-19 Robert Tjarko Lange , Qi Sun , Aaditya Prasad , Maxence Faldor , Yujin Tang , David Ha

Astra: A Multi-Agent System for GPU Kernel Performance Optimization

GPU kernel optimization has long been a central challenge at the intersection of high-performance computing and machine learning. Efficient kernels are crucial for accelerating large language model (LLM) training and serving, yet attaining…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-04 Anjiang Wei , Tianran Sun , Yogesh Seenichamy , Hang Song , Anne Ouyang , Azalia Mirhoseini , Ke Wang , Alex Aiken

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels

Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation. However, generating the code which is deeply hardware-specific, architecture-aware, and performance-critical, especially for massively…

Machine Learning · Computer Science 2025-06-12 Wentao Chen , Jiace Zhu , Qi Fan , Yehan Ma , An Zou

CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning

Large language models (LLMs) are remarked by their substantial computational requirements. To mitigate the cost, researchers develop specialized CUDA kernels, which often fuse several tensor operations to maximize the utilization of GPUs as…

Hardware Architecture · Computer Science 2025-01-15 Guoliang He , Eiko Yoneki

TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization

High-performance GPU kernel optimization remains a critical yet labor-intensive task in modern machine learning workloads. Although Triton, a domain-specific language for GPU programming, enables developers to write efficient kernels with…

Software Engineering · Computer Science 2025-12-16 Haonan Li , Keyu Man , Partha Kanuparthy , Hanning Chen , Wei Sun , Sreen Tallam , Chenguang Zhu , Kevin Zhu , Zhiyun Qian

A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization

GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially alleviate this issue, achieving…

Computation and Language · Computer Science 2026-01-26 Qiuyi Qu , Yicheng Sui , Yufei Sun , Rui Chen , Xiaofei Zhang , Yuzhi Zhang , Haofeng Wang , Ge Lan

Making LLMs Optimize Multi-Scenario CUDA Kernels Like Experts

Optimizing GPU kernels manually is a challenging and time-consuming task. With the rapid development of LLMs, automated GPU kernel optimization is gradually becoming a tangible reality. However, current LLM-driven automated optimization…

Machine Learning · Computer Science 2026-03-10 Yuxuan Han , Meng-Hao Guo , Zhengning Liu , Wenguang Chen , Shi-Min Hu

GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs

In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled and executed cheaply, which fails in large…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-30 Ruifan Chu , Anbang Wang , Xiuxiu Bai , Shuai Liu , Xiaoshe Dong

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

Modern machine learning (ML) workloads increasingly rely on GPUs, yet achieving high end-to-end performance remains challenging due to dependencies on both GPU kernel efficiency and host-side settings. Although LLM-based methods show…

Multiagent Systems · Computer Science 2026-03-04 Shiyang Li , Zijian Zhang , Winson Chen , Yuebo Luo , Mingyi Hong , Caiwen Ding

Improving Efficiency of GPU Kernel Optimization Agents using a Domain-Specific Language and Speed-of-Light Guidance

Optimizing GPU kernels with LLM agents is an iterative process over a large design space. Every candidate must be generated, compiled, validated, and profiled, so fewer trials will save both runtime and cost. We make two key observations.…

Machine Learning · Computer Science 2026-04-01 Siva Kumar Sastry Hari , Vignesh Balaji , Sana Damani , Qijing Huang , Christos Kozyrakis

KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges - model architecture diversity, kernel primitive diversity, and hardware generation…

Machine Learning · Computer Science 2026-01-21 Gang Liao , Hongsen Qin , Ying Wang , Alicia Golden , Michael Kuchnik , Yavuz Yetim , Jia Jiunn Ang , Chunli Fu , Yihan He , Samuel Hsia , Zewei Jiang , Dianshi Li , Uladzimir Pashkevich , Varna Puvvada , Feng Shi , Matt Steiner , Ruichao Xiao , Nathan Yan , Xiayu Yu , Zhou Fang , Roman Levenstein , Kunming Ho , Haishan Zhu , Alec Hammond , Richard Li , Ajit Mathews , Kaustubh Gondkar , Abdul Zainul-Abedin , Ketan Singh , Hongtao Yu , Wenyuan Chi , Barney Huang , Sean Zhang , Noah Weller , Zach Marine , Wyatt Cook , Carole-Jean Wu , Gaoxiang Liu

AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents

GPU kernel optimization is increasingly critical for efficient deep learning systems, but writing high-performance kernels still requires substantial low-level expertise. Recent AI coding agents can iteratively read code, invoke compilers…

Computation and Language · Computer Science 2026-05-19 Sharareh Younesian , Wenwen Ouyang , Sina Rafati , Mehdi Rezagholizadeh , Sharon Zhou , Ji Liu , Yue Liu , Yuchen Yang , Hao Li , Ziqiong Liu , Dong Li , Vikram Appia , Zhenyu Gu , Emad Barsoum

AgentForge: A Flexible Low-Code Platform for Reinforcement Learning Agent Design

Developing a reinforcement learning (RL) agent often involves identifying values for numerous parameters, covering the policy, reward function, environment, and agent-internal architecture. Since these parameters are interrelated in complex…

Machine Learning · Computer Science 2025-04-03 Francisco Erivaldo Fernandes Junior , Antti Oulasvirta