中文
相关论文

相关论文: Xe-Forge: Multi-Stage LLM-Powered Kernel Optimizat…

200 篇论文

High-performance GPU kernel optimization remains a critical yet labor-intensive task in modern machine learning workloads. Although Triton, a domain-specific language for GPU programming, enables developers to write efficient kernels with…

软件工程 · 计算机科学 2025-12-16 Haonan Li , Keyu Man , Partha Kanuparthy , Hanning Chen , Wei Sun , Sreen Tallam , Chenguang Zhu , Kevin Zhu , Zhiyun Qian

Kernel development in deep learning requires optimizing computational units across hardware while balancing memory management, parallelism, and hardware-specific optimizations through extensive empirical tuning. Although domain-specific…

机器学习 · 计算机科学 2025-07-09 Shangzhan Li , Zefan Wang , Ye He , Yuxuan Li , Qi Shi , Jianling Li , Yonggang Hu , Wanxiang Che , Xu Han , Zhiyuan Liu , Maosong Sun

A long-standing goal in both industry and academia is to develop an LLM inference platform that is portable across hardware architectures, eliminates the need for low-level hand-tuning, and still delivers best-in-class efficiency. In this…

机器学习 · 计算机科学 2025-11-18 Burkhard Ringlein , Jan van Lunteren , Radu Stoica , Thomas Parnell

Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertise. Although Large Language Models (LLMs)…

分布式、并行与集群计算 · 计算机科学 2026-02-12 Ali Tehrani , Yahya Emara , Essam Wissam , Wojciech Paluch , Waleed Atallah , Łukasz Dudziak , Mohamed S. Abdelfattah

In the era of LLMs, dense operations such as GEMM and MHA are critical components. These operations are well-suited for parallel execution using a tilebased approach. While traditional GPU programming often relies on low level interfaces…

计算与语言 · 计算机科学 2025-03-27 Dewei Wang , Wei Zhu , Liyang Ling , Ettore Tiotto , Quintin Wang , Whitney Tsang , Julian Opperman , Jacky Deng

Sub-billion-parameter Transformer language models are increasingly deployed on edge devices, where the privacy, latency, and operating-cost advantages of on-device inference are constrained by tight memory-bandwidth, energy, and thermal…

机器学习 · 计算机科学 2026-05-19 Xinting Jiang , Junyi Luo , Ruichen Qi , Kauna Lei , Ben Laurie , Gregory Kielian , Mehdi Saligane

End-to-end (E2E) artificial intelligence (AI) pipelines are composed of several stages including data preprocessing, data ingestion, defining and training the model, hyperparameter optimization, deployment, inference, postprocessing,…

Optimizing GPU kernels for high performance is a complex task, often demanding deep architectural knowledge, extensive profiling, and iterative experimentation. This challenge is amplified when targeting newer or less-documented GPU…

机器学习 · 计算机科学 2025-08-25 Martin Andrews , Sam Witteveen

LLM-based agents are increasingly used to generate GPU kernels, but they often know what optimizations to try without knowing when those optimizations are sound. We introduce KLineage, which learns this missing "when" knowledge from expert…

人工智能 · 计算机科学 2026-05-28 Shuoming Zhang , Qiuchu Yu , Yangyu Zhang , Ruiyuan Xu , Xiyu Shi , Guangli Li , Xiaobing Feng , Huimin Cui , Jiacheng Zhao

GPU kernels are critical for ML performance but difficult to optimize across diverse accelerators. We present KForge, a platform-agnostic framework built on two collaborative LLM-based agents: a generation agent that produces and…

机器学习 · 计算机科学 2025-11-18 Taras Sereda , Tom St. John , Burak Bartan , Natalie Serrino , Sachin Katti , Zain Asgar

Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent research leverages Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels,…

计算与语言 · 计算机科学 2026-05-28 Siqi Guo , Ming Lin , Tianbao Yang

Developing efficient CUDA kernels is increasingly critical for AI applications such as large-scale LLM training. However, manual kernel design is both costly and time-consuming, motivating automatic approaches that leverage LLMs for code…

机器学习 · 计算机科学 2025-11-06 Zijian Zhang , Rong Wang , Shiyang Li , Yuebo Luo , Mingyi Hong , Caiwen Ding

We present HDLFORGE, a two-stage multi-agent framework for automated Verilog generation that optimizes the trade-off between generation speed and accuracy. The system uses a compact coder with a medium-sized LLM by default (Stage A) and…

硬件体系结构 · 计算机科学 2026-03-06 Armin Abdollahi , Saeid Shokoufa , Negin Ashrafi , Mehdi Kamal , Massoud Pedram

The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to…

Pipelining between data loading and computation is a critical tensor program optimization for GPUs. In order to unleash the high performance of latest GPUs, we must perform a synergetic optimization of multi-stage pipelining across the…

分布式、并行与集群计算 · 计算机科学 2023-05-09 Guyue Huang , Yang Bai , Liu Liu , Yuke Wang , Bei Yu , Yufei Ding , Yuan Xie

The advent of ultra-low-bit LLM models (1/1.58/2-bit), which match the perplexity and end-task performance of their full-precision counterparts using the same model size, is ushering in a new era of LLM inference for resource-constrained…

人工智能 · 计算机科学 2026-01-27 Evangelos Georganas , Dhiraj Kalamkar , Alexander Heinecke

In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled and executed cheaply, which fails in large…

分布式、并行与集群计算 · 计算机科学 2025-12-30 Ruifan Chu , Anbang Wang , Xiuxiu Bai , Shuai Liu , Xiaoshe Dong

High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy…

机器学习 · 计算机科学 2026-04-03 Tara Saba , Anne Ouyang , Xujie Si , Fan Long

Finetuning large language models (LLMs) is essential for task adaptation, yet today's serving stacks isolate inference and finetuning on separate GPU clusters -- wasting resources and under-utilizing hardware. We introduce FlexLLM, the…

分布式、并行与集群计算 · 计算机科学 2025-10-27 Gabriele Oliaro , Xupeng Miao , Xinhao Cheng , Vineeth Kada , Mengdi Wu , Ruohan Gao , Yingyi Huang , Remi Delacourt , April Yang , Yingcheng Wang , Colin Unger , Zhihao Jia

Fine-tuning Large Language Models (LLMs) has become essential for domain adaptation, but its memory-intensive property exceeds the capabilities of most GPUs. To address this challenge and democratize LLM fine-tuning, we present SlideFormer,…

分布式、并行与集群计算 · 计算机科学 2026-03-18 Ruijia Yang , Zeyi Wen
‹ 上一页 1 2 3 10 下一页 ›