Computer Science

Demystifying VEINS: A Reality Check Against Living Lab Experiments

Safety applications in vehicle-to-everything communications and Cooperative Intelligent Transport Systems rely on reliable and timely message exchange, which in turn depends on accurate modeling of wireless signal propagation. Simulation…

Performance · Computer Science 2026-05-29 Antonio Solida , Giovanni Gambigliani Zoccoli , Gaetano Orazio Cauchi , Filip Valgimigli , Salvatore Iandolo , Martin Klapez , Maurizio Casoni , Mirco Marchetti , Carlo Augusto Grazia

From Roofline to Ruggedness: Decomposing and Smoothing the GEMM Performance Landscape

Adjacent GEMM problems that differ by a single 128-element step in N can show 30% different throughput on the same GPU. This pervasive performance ruggedness - invisible to roofline analysis and peak-FLOPs intuition, yet dominant for every…

Performance · Computer Science 2026-05-29 Aditya Chatterjee

Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory

Large language models have achieved remarkable capabilities through scaling, and this paper does not challenge that. It instead investigates a different question: once large models already exist, can they become more accessible to…

Performance · Computer Science 2026-05-29 Myeong Jun Jo

Range, Not Precision: Block-Floating-Point Half-Precision FFT and SAR Imaging on Apple Silicon

Half precision (FP16) promises to double FFT throughput on GPUs, but the prevailing view is that its 10-bit mantissa makes it unsuitable for radar-grade signal processing. We show this framing is wrong on Apple Silicon: the binding…

Performance · Computer Science 2026-05-28 Mohamed Amine Bergach

Attributing the System's Overall Effect to its Components

In a computer system, multiple indispensable components-such as the CPU, memory, and others-work together with other essential components to produce an overall effect, which can only be measured on an independently running system. Since the…

Performance · Computer Science 2026-05-27 Chenxi Wang , Lei Wang , Wanling Gao , Fanda Fan , Guoxin Kang , Hongxiao Li , Yuchen Su , Jianfeng Zhan

CARINA: Carbon-Aware Execution of Recurrent Industrial Analytics

Recurring industrial analytics and machine-learning workflows are becoming a major computational burden in modern engineering practice. Large parametric database generation, scheduled model retraining, repeated evaluation pipelines, and…

Performance · Computer Science 2026-05-26 Muhammad Umar Farooq

Throughput-Optimal Multiresource-Job Scheduling with Continuous Requirement Distribution

Modern computing systems process jobs with resource requirements such as CPU and memory, which are described by multiresource jobs (MRJ) queueing models. In practice, job resource requirements are spread out over so many values, that it is…

Performance · Computer Science 2026-05-22 Heyuan Yao , Willow Kowalik , Izzy Grosof

Single-Thread JPEG Decoder Benchmarks Mis-Evaluate ML Data Loaders

JPEG decode is routine ML infrastructure, but Python decoder choices are often justified by single-process, single-thread microbenchmarks. We audit this evaluation assumption with thirteen Python-accessible JPEG decode paths on five matched…

Performance · Computer Science 2026-05-21 Vladimir Iglovikov , Dmitry Kosarevsky

Modeling the Impact of Fiber Latency on Compute-Communication Overlap in Geo-Distributed Multi-Datacenter AI Training

We use discrete-event simulation to quantify the impact of fiber latency on the efficacy of geo-distributed AI model training with data parallelism. We conclude that the optimum distances between two AI clusters is 10-100km, over which…

Performance · Computer Science 2026-05-20 Ioannis Papavasileiou , Sairam Prabhakar , Indu Kant Deo , Sergejs Makovejs

Reducing Waiting Time for Medical Tourists Through Hybrid Agent-Based and Discrete-Event Simulation: A Hospital Case Study

Medical tourists face a scheduling problem that differs from that of local patients. Treatment delays extend not just care delivery time, but also accommodation and travel costs. This study develops a hybrid agent-based and discrete-event…

Performance · Computer Science 2026-05-20 Melika Baghi , Hadi Mosadegh

Scalable Packed Layouts for Vector-Length-Agnostic ML Code Generation

Scalable vector instruction sets such as Arm SVE enable vector-length-agnostic (VLA) execution, allowing a single implementation to adapt across hardware with different vector lengths. However, they complicate compiler code generation, as…

Performance · Computer Science 2026-05-19 Ege Beysel , Maximilian Bartel , Jan Moritz Joseph

Heuristic-Based Merging of HPC Traces to Extend Hardware Counter Coverage

This work extends a framework for predicting the performance of High-Performance Computing (HPC) workloads using Machine Learning (ML). A common limitation in performance modeling is the restricted number of hardware counters that can be…

Performance · Computer Science 2026-05-18 Júlia Orteu Aubach , Fabio Banchelli , Marc Clascà Ramírez , Marta Garcia-Gasulla

SPLIT: SymPathy for Large jobs Improves Tail latency

We study the asymptotic response time tail in the M/G/n multi-server queue with heavy-tailed (regularly varying) job sizes, a setting representative of modern computing workloads. For single-server systems, tail optimization is well…

Performance · Computer Science 2026-05-14 Zhouzi Li , Mor Harchol-Balter , Alan Scheller-Wolf

Privacy-Preserving Aggregation of Controllable Loads to Compensate Fluctuations in Solar Power

Cybersecurity and privacy are of the utmost importance for safe, reliable operation of the electric grid. It is well known that the increased connectivity/interoperability between all stakeholders (e.g., utilities, suppliers, and consumers)…

Systems and Control · Computer Science 2026-05-14 Jin Dong , Teja Kuruganti , Seddik Djouadi , Mohammed Olama , Yaosuo Xue

A Controlled Study of Memory Hierarchy Transitions in Quantum Circuit Simulation on Apple M4 Pro Unified Memory Architecture

State-vector quantum circuit simulation is memory-bandwidth bound, yet the interaction between memory hierarchy, access pattern, and hardware parallelism remains incompletely characterized. We address this using the Apple M4 Pro Unified…

Performance · Computer Science 2026-05-13 Gyan Pratipat

When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

KV-cache quantization is framed as a quality--latency trade-off. We show it is \emph{inverted} on Apple Silicon's unified memory: a single fused Metal kernel (sign-randomized FFT $+$ per-channel $\lambda$ $+$ per-group abs-max $+$ int4…

Performance · Computer Science 2026-05-08 Mohamed Amine Bergach

When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

Open-weight large language models (LLMs) are usually named as model artifacts, but production users often consume them as hosted API services. This paper argues that the operational unit is a service object: a provider-specific,…

Performance · Computer Science 2026-05-08 Haorui Li , Zhenghui He , Xuanzi Liu , Yang Xu , Dongsheng Liu , Jiakang Ma , Lupan Wu , Yangjie Wu , Xiongchao Tang , Tianhui Shi

KEET: Explaining Performance of GPU Kernels Using LLM Agents

Performance profiles of GPU kernels generated by tools such as Nsight Compute are rich in detail but are often challenging to interpret. To achieve the best performance possible on a given GPU architecture, kernel developers need to spend…

Performance · Computer Science 2026-05-07 Joshua H. Davis , Klaudiusz Rydzy , Srinivasan Ramesh , Aadit Nilay , Daniel Nichols , Swapna Raj , Nikhil Jain , Abhinav Bhatele

DEEP-GAP: Deep-learning Evaluation of Execution Parallelism in GPU Architectural Performance

Modern datacenters increasingly rely on low-power, single-slot inference accelerators to balance performance, energy efficiency, and rack density constraints. The NVIDIA T4 GPU has become widely deployed due to strong performance per watt…

Performance · Computer Science 2026-05-07 Kathiravan Palaniappan

SPEC CPU: The Next Generation

The march toward developing relevant and robust CPU benchmarks continues with the introduction of SPEC CPU 2026, the next generation suite for measuring processor performance. This paper details the methodology behind its creation,…

Performance · Computer Science 2026-05-05 Mahesh Madhav , Allen Lee , Andres Mejia , Branden Moore , Charan Soppadandi , Chris Cambly , Christoph Müllner , Daniel Bowers , David Reiner , Denis Bakhvalov , Di Zhao , Duane Voth , Feng Xue , Frédérique Silber-Chaussumier , James Bucek , James Southern , Jiangning Liu , Jim Himer , John Henning , Kevin Smith , Kristen Yang , Kunal Kashyap , Mason Guy , Mat Colgrove , Michael Berg , Prasad Battini , Prasad Joshi , Rohit Prasad , Shayantika Bhattacharya , Sriyash Caculo , Stefan Reimbold , Sundar Iyengar , Van Smith , Zarko Todorovski