Related papers: cuPilot: A Strategy-Coordinated Multi-agent Framew…

EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models

CUDA kernel optimization has become a critical bottleneck for AI performance, as deep learning training and inference efficiency directly depends on highly optimized GPU kernels. Despite the promise of Large Language Models (LLMs) for…

Machine Learning · Computer Science 2025-10-07 Ping Guo , Chenyu Zhu , Siyuan Chen , Fei Liu , Xi Lin , Zhichao Lu , Qingfu Zhang

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Developing efficient CUDA kernels is increasingly critical for AI applications such as large-scale LLM training. However, manual kernel design is both costly and time-consuming, motivating automatic approaches that leverage LLMs for code…

Machine Learning · Computer Science 2025-11-06 Zijian Zhang , Rong Wang , Shiyang Li , Yuebo Luo , Mingyi Hong , Caiwen Ding

CUCo: An Agentic Framework for Compute and Communication Co-design

Custom CUDA kernel development is essential for maximizing GPU utilization in large-scale distributed LLM training and inference, yet manually writing kernels that jointly leverage both computation and communication remains a…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-04 Bodun Hu , Yoga Sri Varshan , Saurabh Agarwal , Aditya Akella

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy…

Machine Learning · Computer Science 2026-04-03 Tara Saba , Anne Ouyang , Xujie Si , Fan Long

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive…

Machine Learning · Computer Science 2026-03-02 Weinan Dai , Hanlin Wu , Qiying Yu , Huan-ang Gao , Jiahao Li , Chengquan Jiang , Weiqiang Lou , Yufan Song , Hongli Yu , Jiaze Chen , Wei-Ying Ma , Ya-Qin Zhang , Jingjing Liu , Mingxuan Wang , Xin Liu , Hao Zhou

OptiML: An End-to-End Framework for Program Synthesis and CUDA Kernel Optimization

Generating high-performance CUDA kernels remains challenging due to the need to navigate a combinatorial space of low-level transformations under noisy and expensive hardware feedback. Although large language models can synthesize…

Machine Learning · Computer Science 2026-02-16 Arijit Bhattacharjee , Heng Ping , Son Vu Le , Paul Bogdan , Nesreen K. Ahmed , Ali Jannesari

A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization

GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially alleviate this issue, achieving…

Computation and Language · Computer Science 2026-01-26 Qiuyi Qu , Yicheng Sui , Yufei Sun , Rui Chen , Xiaofei Zhang , Yuzhi Zhang , Haofeng Wang , Ge Lan

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the…

Machine Learning · Computer Science 2025-12-15 Songqiao Su , Xiaofei Sun , Xiaoya Li , Albert Wang , Jiwei Li , Chris Shum

Agentic Auto-Scheduling: An Experimental Study of LLM-Guided Loop Optimization

Automatic code optimization remains a difficult challenge, particularly for complex loop nests on modern hardware. This paper investigates a novel approach to code optimization where Large Language Models (LLMs) guide the process through a…

Programming Languages · Computer Science 2025-12-30 Massinissa Merouani , Islem Kara Bernou , Riyadh Baghdadi

CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments

Efficient GPU execution of convolution operators is governed by memory-access efficiency, on-chip data reuse, and execution mapping rather than arithmetic throughput alone. This paper presents a controlled operator-level study of CUDA…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-30 Huriyeh Babak , Melanie Schaller

Tutoring LLM into a Better CUDA Optimizer

Recent leaps in large language models (LLMs) caused a revolution in programming tools (like GitHub Copilot) that can help with code generation, debugging, and even performance optimization. In this paper, we focus on the capabilities of the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-21 Matyáš Brabec , Jiří Klepl , Michal Töpfer , Martin Kruliš

CuBridge: An LLM-Based Framework for Understanding and Reconstructing High-Performance Attention Kernels

Efficient CUDA implementations of attention mechanisms are critical to modern deep learning systems, yet supporting diverse and evolving attention variants remains challenging. Existing frameworks and compilers trade performance for…

Machine Learning · Computer Science 2026-05-07 Xing Ma , Yangjie Zhou , Wu Sun , Zihan Liu , Jingwen Leng , Yun Lin , Shixuan Sun , Minyi Guo , Jin Song Dong

Astra: A Multi-Agent System for GPU Kernel Performance Optimization

GPU kernel optimization has long been a central challenge at the intersection of high-performance computing and machine learning. Efficient kernels are crucial for accelerating large language model (LLM) training and serving, yet attaining…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-04 Anjiang Wei , Tianran Sun , Yogesh Seenichamy , Hang Song , Anne Ouyang , Azalia Mirhoseini , Ke Wang , Alex Aiken

Making LLMs Optimize Multi-Scenario CUDA Kernels Like Experts

Optimizing GPU kernels manually is a challenging and time-consuming task. With the rapid development of LLMs, automated GPU kernel optimization is gradually becoming a tangible reality. However, current LLM-driven automated optimization…

Machine Learning · Computer Science 2026-03-10 Yuxuan Han , Meng-Hao Guo , Zhengning Liu , Wenguang Chen , Shi-Min Hu

KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning

Optimizing CUDA code across multiple generations of GPU architectures is challenging, as achieving peak performance requires an extensive exploration of an increasingly complex, hardware-specific optimization space. Traditional compilers…

Machine Learning · Computer Science 2026-02-17 Kris Shengjun Dong , Sahil Modi , Dima Nikiforov , Sana Damani , Edward Lin , Siva Kumar Sastry Hari , Christos Kozyrakis

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Recent advances in large language models (LLMs) demonstrate their effectiveness in scaling test-time compute for software engineering tasks. However, these approaches often focus on high-level solutions, with limited attention to optimizing…

Software Engineering · Computer Science 2025-09-19 Robert Tjarko Lange , Qi Sun , Aaditya Prasad , Maxence Faldor , Yujin Tang , David Ha

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Improving GPU kernel efficiency is crucial for advancing AI systems. Recent work has explored leveraging large language models (LLMs) for GPU kernel generation and optimization. However, existing LLM-based kernel optimization pipelines…

Machine Learning · Computer Science 2026-03-12 Qitong Sun , Jun Han , Tianlin Li , Zhe Tang , Sheng Chen , Fei Yang , Aishan Liu , Xianglong Liu , Yang Liu

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

Large language models (LLMs) have shown strong empirical gains as self-evolving agents for CUDA kernel generation, driven by feedback-conditioned planning across generations. However, how planning decisions attribute and combine…

Artificial Intelligence · Computer Science 2026-05-27 Yee Hin Chong , Jiaming Wu , Youhui Zhang , Peng Qu

cuGenOpt: A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization

Combinatorial optimization problems arise in logistics, scheduling, and resource allocation, yet existing approaches face a fundamental trade-off among generality, performance, and usability. We present cuGenOpt, a GPU-accelerated…

Artificial Intelligence · Computer Science 2026-03-20 Yuyang Liu

Choose Your Programming Copilot: A Comparison of the Program Synthesis Performance of GitHub Copilot and Genetic Programming

GitHub Copilot, an extension for the Visual Studio Code development environment powered by the large-scale language model Codex, makes automatic program synthesis available for software developers. This model has been extensively studied in…

Software Engineering · Computer Science 2021-11-16 Dominik Sobania , Martin Briesch , Franz Rothlauf