Related papers: Counting Without Running: Evaluating LLMs' Reasoni…

Can Large Language Models Predict Parallel Code Performance?

Accurate determination of the performance of parallel GPU code typically requires execution-time profiling on target hardware -- an increasingly prohibitive step due to limited access to high-end GPUs. This paper explores whether Large…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-08 Gregory Bolet , Giorgis Georgakoudis , Harshitha Menon , Konstantinos Parasyris , Niranjan Hasabnis , Hayden Estes , Kirk W. Cameron , Gal Oren

GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs

In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled and executed cheaply, which fails in large…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-30 Ruifan Chu , Anbang Wang , Xiuxiu Bai , Shuai Liu , Xiaoshe Dong

KernelBench: Can LLMs Write Efficient GPU Kernels?

Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate…

Machine Learning · Computer Science 2025-02-18 Anne Ouyang , Simon Guo , Simran Arora , Alex L. Zhang , William Hu , Christopher Ré , Azalia Mirhoseini

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness…

Software Engineering · Computer Science 2026-05-18 Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song

RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models

With the increasing use of large language models (LLMs), ensuring reliable performance in diverse, real-world environments is essential. Despite their remarkable achievements, LLMs often struggle with adversarial inputs, significantly…

Computation and Language · Computer Science 2024-06-18 Yuqing Wang , Yun Zhao

RPU -- A Reasoning Processing Unit

Large language model (LLM) inference performance is increasingly bottlenecked by the memory wall. While GPUs continue to scale raw compute throughput, they struggle to deliver scalable performance for memory bandwidth bound workloads. This…

Hardware Architecture · Computer Science 2026-02-25 Matthew Adiletta , Gu-Yeon Wei , David Brooks

LLMPerf: GPU Performance Modeling meets Large Language Models

Performance modeling, a pivotal domain in program cost analysis, currently relies on manually crafted models constrained by various program and hardware limitations, especially in the intricate landscape of GPGPU. Meanwhile, Large Language…

Performance · Computer Science 2025-03-17 Khoi N. M. Nguyen , Hoang Duy Nguyen Do , Huyen Thao Le , Thanh Tuan Dao

ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions

Modern Large Language Models (LLMs) have shown astounding capabilities of code understanding and synthesis. In order to assess such capabilities, several benchmarks have been devised (e.g., HumanEval). However, most benchmarks focus on code…

Software Engineering · Computer Science 2025-03-07 Julian Aron Prenner , Romain Robbes

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

Recent studies have demonstrated the potential of Large Language Models (LLMs) in generating GPU Kernels. Current benchmarks focus on the translation of high-level languages into CUDA, overlooking the more general and challenging task of…

Machine Learning · Computer Science 2026-03-04 Jiace Zhu , Wentao Chen , Qi Fan , Zhixing Ren , Junying Wu , Xing Zhe Chai , Chotiwit Rungrueangwutthinon , Yehan Ma , An Zou

Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference

Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the…

Computation and Language · Computer Science 2025-10-28 Jiayi Yuan , Hao Li , Xinheng Ding , Wenya Xie , Yu-Jhe Li , Wentian Zhao , Kun Wan , Jing Shi , Xia Hu , Zirui Liu

FlipFlop: A Static Analysis-based Energy Optimization Framework for GPU Kernels

Artificial Intelligence (AI) applications, such as Large Language Models, are primarily driven and executed by Graphics Processing Units (GPUs). These GPU programs (kernels) consume substantial amounts of energy, yet software developers…

Software Engineering · Computer Science 2026-01-21 Saurabhsingh Rajput , Alexander Brandt , Vadim Elisseev , Tushar Sharma

An Empirical Study of Reasoning Steps in Thinking Code LLMs

Thinking Large Language Models (LLMs) generate explicit intermediate reasoning traces before final answers, potentially improving transparency, interpretability, and solution accuracy for code generation. However, the quality of these…

Artificial Intelligence · Computer Science 2025-11-11 Haoran Xue , Gias Uddin , Song Wang

Leveraging Mathematical Reasoning of LLMs for Efficient GPU Thread Mapping

Mapping parallel threads onto non-box-shaped domains is a known challenge in GPU computing; efficient mapping prevents performance penalties from unnecessary resource allocation. Currently, achieving this requires significant analytical…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-15 Jose Maureira , Cristóbal A. Navarro , Hector Ferrada , Luis Veas-Castillo

Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice

Code review is a cornerstone of software quality assurance, and recent advances in Large Language Models (LLMs) have shown promise in its automation. However, existing benchmarks for LLM-based code review face three major limitations. Lack…

Software Engineering · Computer Science 2026-01-01 Ruida Hu , Xinchen Wang , Xin-Cheng Wen , Zhao Zhang , Bo Jiang , Pengfei Gao , Chao Peng , Cuiyun Gao

CodeMind: Evaluating Large Language Models for Code Reasoning

Large Language Models (LLMs) have been widely used to automate programming tasks. Their capabilities have been evaluated by assessing the quality of generated code through tests or proofs. The extent to which they can reason about code is a…

Software Engineering · Computer Science 2026-04-08 Changshu Liu , Yang Chen , Reyhaneh Jabbarvand

LookupFFN: Making Transformers Compute-lite for CPU inference

While GPU clusters are the de facto choice for training large deep neural network (DNN) models today, several reasons including ease of workflow, security and cost have led to efforts investigating whether CPUs may be viable for inference…

Machine Learning · Computer Science 2024-03-13 Zhanpeng Zeng , Michael Davies , Pranav Pulijala , Karthikeyan Sankaralingam , Vikas Singh

ChipBench: A Next-Step Benchmark for Evaluating LLM Performance in AI-Aided Chip Design

While Large Language Models (LLMs) show significant potential in hardware engineering, current benchmarks suffer from saturation and limited task diversity, failing to reflect LLMs' performance in real industrial workflows. To address this…

Artificial Intelligence · Computer Science 2026-02-03 Zhongkai Yu , Chenyang Zhou , Yichen Lin , Hejia Zhang , Haotian Ye , Junxia Cui , Zaifeng Pan , Jishen Zhao , Yufei Ding

Navigating the Labyrinth: Evaluating LLMs' Ability to Reason About Search Problems

Large Language Models (LLMs) have recently achieved impressive performance in math and reasoning benchmarks. However, they often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we…

Artificial Intelligence · Computer Science 2025-09-16 Nasim Borazjanizadeh , Roei Herzig , Trevor Darrell , Rogerio Feris , Leonid Karlinsky

ProBench: Benchmarking Large Language Models in Competitive Programming

With reasoning language models such as OpenAI-o3 and DeepSeek-R1 emerging, large language models (LLMs) have entered a new phase of development. However, existing benchmarks for coding evaluation are gradually inadequate to assess the…

Computation and Language · Computer Science 2025-03-03 Lei Yang , Renren Jin , Ling Shi , Jianxiang Peng , Yue Chen , Deyi Xiong

ResBench: Benchmarking LLM-Generated FPGA Designs with Resource Awareness

Field-Programmable Gate Arrays (FPGAs) are widely used in modern hardware design, yet writing Hardware Description Language (HDL) code for FPGA implementation remains a complex and time-consuming task. Large Language Models (LLMs) have…

Hardware Architecture · Computer Science 2025-03-25 Ce Guo , Tong Zhao