English
Related papers

Related papers: PRAGMA: A Profiling-Reasoned Multi-Agent Framework…

200 papers

The efficiency of GPU kernels is central to the progress of modern AI, yet optimizing them remains a difficult and labor-intensive task due to complex interactions between memory hierarchies, thread scheduling, and hardware-specific…

Artificial Intelligence · Computer Science 2025-10-21 Juncheng Dong , Yang Yang , Tao Liu , Yang Wang , Feng Qi , Vahid Tarokh , Kaushik Rangadurai , Shuang Yang

In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled and executed cheaply, which fails in large…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-30 Ruifan Chu , Anbang Wang , Xiuxiu Bai , Shuai Liu , Xiaoshe Dong

The performance of modern AI systems is fundamentally constrained by the quality of their underlying kernels, which translate high-level algorithmic semantics into low-level hardware operations. Achieving near-optimal kernels requires…

Improving GPU kernel efficiency is crucial for advancing AI systems. Recent work has explored leveraging large language models (LLMs) for GPU kernel generation and optimization. However, existing LLM-based kernel optimization pipelines…

Machine Learning · Computer Science 2026-03-12 Qitong Sun , Jun Han , Tianlin Li , Zhe Tang , Sheng Chen , Fei Yang , Aishan Liu , Xianglong Liu , Yang Liu

GPU kernel optimization has long been a central challenge at the intersection of high-performance computing and machine learning. Efficient kernels are crucial for accelerating large language model (LLM) training and serving, yet attaining…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-04 Anjiang Wei , Tianran Sun , Yogesh Seenichamy , Hang Song , Anne Ouyang , Azalia Mirhoseini , Ke Wang , Alex Aiken

Optimizing GPU kernels for high performance is a complex task, often demanding deep architectural knowledge, extensive profiling, and iterative experimentation. This challenge is amplified when targeting newer or less-documented GPU…

Machine Learning · Computer Science 2025-08-25 Martin Andrews , Sam Witteveen

Optimizing Large Language Model (LLM) inference in production systems is increasingly difficult due to dynamic workloads, stringent latency/throughput targets, and a rapidly expanding configuration space. This complexity spans not only…

Optimizing GPU kernels presents a significantly greater challenge for large language models (LLMs) than standard code generation tasks, as it requires understanding hardware architecture, parallel optimization strategies, and performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-16 Nina Wiedemann , Quentin Leboutet , Michael Paulitsch , Diana Wofk , Benjamin Ummenhofer

Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation. However, generating the code which is deeply hardware-specific, architecture-aware, and performance-critical, especially for massively…

Machine Learning · Computer Science 2025-06-12 Wentao Chen , Jiace Zhu , Qi Fan , Yehan Ma , An Zou

Advancements in large language models (LLMs) are showing promising impact in software development and programming assistance. However, these models struggle when operating on low-level backend code. This challenge is exacerbated in the…

Software Engineering · Computer Science 2025-12-23 Muhammad Usman Tariq , Abhinav Jangda , Angelica Moreira , Madan Musuvathi , Tyler Sorensen

High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy…

Machine Learning · Computer Science 2026-04-03 Tara Saba , Anne Ouyang , Xujie Si , Fan Long

High-performance GPU kernel optimization remains a critical yet labor-intensive task in modern machine learning workloads. Although Triton, a domain-specific language for GPU programming, enables developers to write efficient kernels with…

Software Engineering · Computer Science 2025-12-16 Haonan Li , Keyu Man , Partha Kanuparthy , Hanning Chen , Wei Sun , Sreen Tallam , Chenguang Zhu , Kevin Zhu , Zhiyun Qian

Large language models (LLMs) show promise for automated code optimization. However, without performance context, they struggle to produce correct and effective code transformations. Existing performance tools can identify bottlenecks but…

Performance · Computer Science 2026-04-28 Mohammad Zaeed , Tanzima Z. Islam , Vladimir Indic

Large Language Models (LLMs) demand substantial computational resources, resulting in high energy consumption on GPUs. To address this challenge, we focus on Coarse-Grained Reconfigurable Arrays (CGRAs) as an effective alternative that…

Hardware Architecture · Computer Science 2025-12-02 Takuto Ando , Yu Eto , Ayumu Takeuchi , Yasuhiko Nakashima

Accurate determination of the performance of parallel GPU code typically requires execution-time profiling on target hardware -- an increasingly prohibitive step due to limited access to high-end GPUs. This paper explores whether Large…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-08 Gregory Bolet , Giorgis Georgakoudis , Harshitha Menon , Konstantinos Parasyris , Niranjan Hasabnis , Hayden Estes , Kirk W. Cameron , Gal Oren

GPU kernels are critical for ML performance but difficult to optimize across diverse accelerators. We present KForge, a platform-agnostic framework built on two collaborative LLM-based agents: a generation agent that produces and…

Machine Learning · Computer Science 2025-11-18 Taras Sereda , Tom St. John , Burak Bartan , Natalie Serrino , Sachin Katti , Zain Asgar

High-Level Synthesis enables the rapid prototyping of hardware accelerators, by combining a high-level description of the functional behavior of a kernel with a set of micro-architecture optimizations as inputs. Such optimizations can be…

Hardware Architecture · Computer Science 2025-02-11 Stéphane Pouget , Louis-Noël Pouchet , Jason Cong

New AI accelerators with novel instruction set architectures (ISAs) often require developers to manually craft low-level kernels -- a time-consuming, laborious, and error-prone process that cannot scale across diverse hardware targets. This…

Hardware Architecture · Computer Science 2026-03-11 Jiayi Nie , Haoran Wu , Yao Lai , Zeyu Cao , Cheng Zhang , Binglei Lou , Erwei Wang , Jianyi Cheng , Timothy M. Jones , Robert Mullins , Rika Antonova , Yiren Zhao

Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate…

Machine Learning · Computer Science 2025-02-18 Anne Ouyang , Simon Guo , Simran Arora , Alex L. Zhang , William Hu , Christopher Ré , Azalia Mirhoseini

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they…

Machine Learning · Computer Science 2026-05-25 Gabriele Oliaro , Yichao Fu , May Jiang , Owen Lu , Junli Wang , Zhihao Jia , Hao Zhang , Samyam Rajbhandari
‹ Prev 1 2 3 10 Next ›