性能 — Scifaro

DEEP-GAP: Deep-learning Evaluation of Execution Parallelism in GPU Architectural Performance

Modern datacenters increasingly rely on low-power, single-slot inference accelerators to balance performance, energy efficiency, and rack density constraints. The NVIDIA T4 GPU has become widely deployed due to strong performance per watt…

性能 · 计算机科学 2026-05-07 Kathiravan Palaniappan

SPEC CPU: The Next Generation

The march toward developing relevant and robust CPU benchmarks continues with the introduction of SPEC CPU 2026, the next generation suite for measuring processor performance. This paper details the methodology behind its creation,…

性能 · 计算机科学 2026-05-05 Mahesh Madhav , Allen Lee , Andres Mejia , Branden Moore , Charan Soppadandi , Chris Cambly , Christoph Müllner , Daniel Bowers , David Reiner , Denis Bakhvalov , Di Zhao , Duane Voth , Feng Xue , Frédérique Silber-Chaussumier , James Bucek , James Southern , Jiangning Liu , Jim Himer , John Henning , Kevin Smith , Kristen Yang , Kunal Kashyap , Mason Guy , Mat Colgrove , Michael Berg , Prasad Battini , Prasad Joshi , Rohit Prasad , Shayantika Bhattacharya , Sriyash Caculo , Stefan Reimbold , Sundar Iyengar , Van Smith , Zarko Todorovski

Priority Scheduling in the M/G/1 with Preemption Overhead

Virtually all practical settings where preemptive scheduling is employed are susceptible to preemption overhead, and accounting for these overheads is necessary to make informed scheduling design decisions. However, preemption overhead is…

性能 · 计算机科学 2026-05-05 Shefali Ramakrishna , Edwin Peng , Ziv Scully

Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference

The operational landscape of local Large Language Model (LLM) inference has shifted from lightweight models to datacenter-class weights exceeding 70B parameters, creating profound systems challenges for consumer hardware. This paper…

性能 · 计算机科学 2026-05-05 Abdurrahman Javat , Allan Kazakov

Revealing NVIDIA Closed-Source Driver Command Streams for CPU-GPU Runtime Behavior Insight

For NVIDIA GPUs, CUDA is the primary interface through which applications orchestrate GPU execution, yet much of the logic that realizes CUDA operations resides in NVIDIA's closed-source userspace driver. As a result, the translation from…

性能 · 计算机科学 2026-04-30 Yuang Yan , Ian Karlin , Ryan Grant

PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction

The rapid expansion of Transformer-based large language models has dramatically increased the need for high-performance GPUs. As a result, there is growing demand for fast, accurate, and widely generalizable GPU performance models to…

性能 · 计算机科学 2026-04-29 Kaixuan Zhang , Yunfan Cui , Shuhao Zhang , Chutong Ding , Shiyou Qian , Luping Wang , Jian Cao , Guangtao Xue , Cheng Huang , Guodong Yang , Liping Zhang

denet, A lightweight command-line tool for process monitoring in benchmarking and beyond

Summary: denet is a lightweight process monitoring tool providing real-time resource profiling of running processes. It reports CPU, memory, disk I/O, network activity, and thread usage, including recursive child monitoring, with adaptive…

性能 · 计算机科学 2026-04-29 Ben Carrillo , Izaskun Mallona

Energy-Aware LLMs: A step towards sustainable AI for downstream applications

Advanced Large Language Models (LLMs) have revolutionized various fields, including communication networks, sparking an innovation wave that has led to new applications and services, and significantly enhanced solution schemes. Despite all…

性能 · 计算机科学 2026-04-29 Nguyen Phuc Tran , Brigitte Jaumard , Oscar Delgado

Optimas: An Intelligent Analytics-Informed Generative AI Framework for Performance Optimization

Large language models (LLMs) show promise for automated code optimization. However, without performance context, they struggle to produce correct and effective code transformations. Existing performance tools can identify bottlenecks but…

性能 · 计算机科学 2026-04-28 Mohammad Zaeed , Tanzima Z. Islam , Vladimir Indic

COMPASS: A Unified Decision-Intelligence System for Navigating Performance Trade-off in HPC

HPC systems expose many configuration parameters that jointly drive competing objectives. Existing tools such as autotuners recommend good configurations but do not identify minimal changes for a near-miss configuration to meet a…

性能 · 计算机科学 2026-04-28 Ankur Lahiry , Banooqa Banday , Yugesh Bhattarai , Mohammad Zaeed , Tanzima Z. Islam

HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing

As modern LLMs support thousands to millions of tokens, KV caches grow to hundreds of gigabytes, stressing memory capacity and bandwidth. Existing solutions, such as KV cache pruning and offloading, alleviate these but underutilize hardware…

性能 · 计算机科学 2026-04-21 Mao Lin , Xi Wang , Guilherme Cox , Dong Li , Hyeran Jeon

Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU

Large Language Model (LLM) deployment is increasingly shifting to cost-efficient accelerators like Google's Tensor Processing Units (TPUs), prioritizing both performance and total cost of ownership (TCO). However, existing LLM inference…

性能 · 计算机科学 2026-04-20 Jevin Jiang , Ying Chen , Blake A. Hechtman , Fenghui Zhang , Yarong Mu

Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning

In real-world Tool-Integrated Reasoning (TIR) scenarios, where LLMs interleave reasoning with external tool calls, a major source of inefficiency is that the toolcalls create pauses between LLM requests and cause KV-Cache eviction, forcing…

性能 · 计算机科学 2026-04-15 Qisheng Su , Shiting Huang , Zhen Fang , Ziyan Chen , Zehui Chen , Feng Zhao

Architectural Trade-offs in the Energy-Efficient Era: A Comparative Study of power-capping NVIDIA H100 and H200

Modern NVIDIA GPUs like the H100 (HBM2e) and H200 (HBM3e) share similar compute characteristics but differ significantly in memory interface technology and bandwidth. By isolating memory bandwidth as a key variable, the power distribution…

性能 · 计算机科学 2026-04-14 Aditya Ujeniya , Jan Eitzinger , Georg Hager , Gerhard Wellein

WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning

The rapid adoption of Large Language Models (LLMs) has made GPU inference efficiency an increasingly critical system concern. The runtime of LLM workloads is largely dominated by tile-based kernels, particularly General Matrix…

性能 · 计算机科学 2026-04-14 Kaixuan Zhang , Chutong Ding , Shiyou Qian , Luping Wang , Jian Cao , Guangtao Xue , Cheng Huang , Guodong Yang , Liping Zhang

Mosaic: Cross-Modal Clustering for Efficient Video Understanding

Large vision-language models (VLMs) are enabling interactive video reasoning, giving rise to streaming long-video understanding. In this setting, frames arrive continuously, while the system preserves long-term context and generates…

性能 · 计算机科学 2026-04-14 Tuowei Wang , He Zhou , Chengru Song , Qiushi Li , Ju Ren

Training Transformers in Cosine Coefficient Space

Linear layers hold most of a transformer's parameters. We replace each linear layer with one that stores $K$ out of $mn$ two-dimensional DCT coefficients per weight matrix and reconstructs the full matrix through an inverse DCT at every…

性能 · 计算机科学 2026-04-10 Mohamed Amine Bergach

ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of…

性能 · 计算机科学 2026-04-09 Wangsong Yin , Daliang Xu , Mengwei Xu , Gang Huang , Xuanzhe Liu

Memshare: Memory Sharing for Multicore Computation in R with an Application to Feature Selection by Mutual Information using PDE

We present memshare\footnote{The Software package is published as a CRAN package under https://CRAN.R-project.org/package=memshare, a package that enables shared memory multicore computation in R by allocating buffers in C++ shared memory…

性能 · 计算机科学 2026-04-08 Michael C. Thrun , Julian Märte

Shortest-Path FFT: Optimal SIMD Instruction Scheduling via Graph Search

An $N$-point FFT admits many valid implementations that differ in radix choice, stage ordering, and register-blocking strategy. These alternatives use different SIMD instruction mixes with different latencies, yet produce the same…

性能 · 计算机科学 2026-04-07 Mohamed Amine Bergach