相关论文: Xe-Forge: Multi-Stage LLM-Powered Kernel Optimizat…

TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization

High-performance GPU kernel optimization remains a critical yet labor-intensive task in modern machine learning workloads. Although Triton, a domain-specific language for GPU programming, enables developers to write efficient kernels with…

软件工程 · 计算机科学 2025-12-16 Haonan Li , Keyu Man , Partha Kanuparthy , Hanning Chen , Wei Sun , Sreen Tallam , Chenguang Zhu , Kevin Zhu , Zhiyun Qian

AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs

Kernel development in deep learning requires optimizing computational units across hardware while balancing memory management, parallelism, and hardware-specific optimizations through extensive empirical tuning. Although domain-specific…

机器学习 · 计算机科学 2025-07-09 Shangzhan Li , Zefan Wang , Ye He , Yuxuan Li , Qi Shi , Jianling Li , Yonggang Hu , Wanxiang Che , Xu Han , Zhiyuan Liu , Maosong Sun

The Anatomy of a Triton Attention Kernel

A long-standing goal in both industry and academia is to develop an LLM inference platform that is portable across hardware architectures, eliminates the need for low-level hand-tuning, and still delivers best-in-class efficiency. In this…

机器学习 · 计算机科学 2025-11-18 Burkhard Ringlein , Jan van Lunteren , Radu Stoica , Thomas Parnell

Fine-Tuning GPT-5 for GPU Kernel Generation

Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertise. Although Large Language Models (LLMs)…

分布式、并行与集群计算 · 计算机科学 2026-02-12 Ali Tehrani , Yahya Emara , Essam Wissam , Wojciech Paluch , Waleed Atallah , Łukasz Dudziak , Mohamed S. Abdelfattah

ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming

In the era of LLMs, dense operations such as GEMM and MHA are critical components. These operations are well-suited for parallel execution using a tilebased approach. While traditional GPU programming often relies on low level interfaces…

计算与语言 · 计算机科学 2025-03-27 Dewei Wang , Wei Zhu , Liyang Ling , Ettore Tiotto , Quintin Wang , Whitney Tsang , Julian Opperman , Jacky Deng

LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

Sub-billion-parameter Transformer language models are increasingly deployed on edge devices, where the privacy, latency, and operating-cost advantages of on-device inference are constrained by tight memory-bandwidth, energy, and thermal…

机器学习 · 计算机科学 2026-05-19 Xinting Jiang , Junyi Luo , Ruichen Qi , Kauna Lei , Ben Laurie , Gregory Kielian , Mehdi Saligane

Strategies for Optimizing End-to-End Artificial Intelligence Pipelines on Intel Xeon Processors

End-to-end (E2E) artificial intelligence (AI) pipelines are composed of several stages including data preprocessing, data ingestion, defining and training the model, hyperparameter optimization, deployment, inference, postprocessing,…

机器学习 · 计算机科学 2022-11-02 Meena Arunachalam , Vrushabh Sanghavi , Yi A Yao , Yi A Zhou , Lifeng A Wang , Zongru Wen , Niroop Ammbashankar , Ning W Wang , Fahim Mohammad

GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization

Optimizing GPU kernels for high performance is a complex task, often demanding deep architectural knowledge, extensive profiling, and iterative experimentation. This challenge is amplified when targeting newer or less-documented GPU…

机器学习 · 计算机科学 2025-08-25 Martin Andrews , Sam Witteveen

Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages

LLM-based agents are increasingly used to generate GPU kernels, but they often know what optimizations to try without knowing when those optimizations are sound. We introduce KLineage, which learns this missing "when" knowledge from expert…

人工智能 · 计算机科学 2026-05-28 Shuoming Zhang , Qiuchu Yu , Yangyu Zhang , Ruiyuan Xu , Xiyu Shi , Guangli Li , Xiaobing Feng , Huimin Cui , Jiacheng Zhao

KForge: Program Synthesis for Diverse AI Hardware Accelerators

GPU kernels are critical for ML performance but difficult to optimize across diverse accelerators. We present KForge, a platform-agnostic framework built on two collaborative LLM-based agents: a generation agent that produces and…

机器学习 · 计算机科学 2025-11-18 Taras Sereda , Tom St. John , Burak Bartan , Natalie Serrino , Sachin Katti , Zain Asgar

DRTriton: Large-Scale Synthetic Data Driven Reinforcement Learning for Triton Kernel Generation

Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent research leverages Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels,…

计算与语言 · 计算机科学 2026-05-28 Siqi Guo , Ming Lin , Tianbao Yang

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Developing efficient CUDA kernels is increasingly critical for AI applications such as large-scale LLM training. However, manual kernel design is both costly and time-consuming, motivating automatic approaches that leverage LLMs for code…

机器学习 · 计算机科学 2025-11-06 Zijian Zhang , Rong Wang , Shiyang Li , Yuebo Luo , Mingyi Hong , Caiwen Ding

HDLFORGE: A Two-Stage Multi-Agent Framework for Efficient Verilog Code Generation with Adaptive Model Escalation

We present HDLFORGE, a two-stage multi-agent framework for automated Verilog generation that optimizes the trade-off between generation speed and accuracy. The system uses a compact coder with a medium-sized LLM by default (Stage A) and…

硬件体系结构 · 计算机科学 2026-03-06 Armin Abdollahi , Saeid Shokoufa , Negin Ashrafi , Mehdi Kamal , Massoud Pedram

Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to…

计算与语言 · 计算机科学 2025-08-01 Jianghui Wang , Vinay Joshi , Saptarshi Majumder , Xu Chao , Bin Ding , Ziqiong Liu , Pratik Prabhanjan Brahma , Dong Li , Zicheng Liu , Emad Barsoum

ALCOP: Automatic Load-Compute Pipelining in Deep Learning Compiler for AI-GPUs

Pipelining between data loading and computation is a critical tensor program optimization for GPUs. In order to unleash the high performance of latest GPUs, we must perform a synergetic optimization of multi-stage pipelining across the…

分布式、并行与集群计算 · 计算机科学 2023-05-09 Guyue Huang , Yang Bai , Liu Liu , Yuke Wang , Bei Yu , Yufei Ding , Yuan Xie

Pushing the Envelope of LLM Inference on AI-PC and Intel GPUs

The advent of ultra-low-bit LLM models (1/1.58/2-bit), which match the perplexity and end-task performance of their full-precision counterparts using the same model size, is ushering in a new era of LLM inference for resource-constrained…

人工智能 · 计算机科学 2026-01-27 Evangelos Georganas , Dhiraj Kalamkar , Alexander Heinecke

GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs

In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled and executed cheaply, which fails in large…

分布式、并行与集群计算 · 计算机科学 2025-12-30 Ruifan Chu , Anbang Wang , Xiuxiu Bai , Shuai Liu , Xiaoshe Dong

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy…

机器学习 · 计算机科学 2026-04-03 Tara Saba , Anne Ouyang , Xujie Si , Fan Long

FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees

Finetuning large language models (LLMs) is essential for task adaptation, yet today's serving stacks isolate inference and finetuning on separate GPU clusters -- wasting resources and under-utilizing hardware. We introduce FlexLLM, the…

分布式、并行与集群计算 · 计算机科学 2025-10-27 Gabriele Oliaro , Xupeng Miao , Xinhao Cheng , Vineeth Kada , Mengdi Wu , Ruohan Gao , Yingyi Huang , Remi Delacourt , April Yang , Yingcheng Wang , Colin Unger , Zhihao Jia

An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU

Fine-tuning Large Language Models (LLMs) has become essential for domain adaptation, but its memory-intensive property exceeds the capabilities of most GPUs. To address this challenge and democratize LLM fine-tuning, we present SlideFormer,…

分布式、并行与集群计算 · 计算机科学 2026-03-18 Ruijia Yang , Zeyi Wen