性能
Modern datacenters increasingly rely on low-power, single-slot inference accelerators to balance performance, energy efficiency, and rack density constraints. The NVIDIA T4 GPU has become widely deployed due to strong performance per watt…
The march toward developing relevant and robust CPU benchmarks continues with the introduction of SPEC CPU 2026, the next generation suite for measuring processor performance. This paper details the methodology behind its creation,…
Virtually all practical settings where preemptive scheduling is employed are susceptible to preemption overhead, and accounting for these overheads is necessary to make informed scheduling design decisions. However, preemption overhead is…
The operational landscape of local Large Language Model (LLM) inference has shifted from lightweight models to datacenter-class weights exceeding 70B parameters, creating profound systems challenges for consumer hardware. This paper…
For NVIDIA GPUs, CUDA is the primary interface through which applications orchestrate GPU execution, yet much of the logic that realizes CUDA operations resides in NVIDIA's closed-source userspace driver. As a result, the translation from…
The rapid expansion of Transformer-based large language models has dramatically increased the need for high-performance GPUs. As a result, there is growing demand for fast, accurate, and widely generalizable GPU performance models to…
Summary: denet is a lightweight process monitoring tool providing real-time resource profiling of running processes. It reports CPU, memory, disk I/O, network activity, and thread usage, including recursive child monitoring, with adaptive…
Advanced Large Language Models (LLMs) have revolutionized various fields, including communication networks, sparking an innovation wave that has led to new applications and services, and significantly enhanced solution schemes. Despite all…
Large language models (LLMs) show promise for automated code optimization. However, without performance context, they struggle to produce correct and effective code transformations. Existing performance tools can identify bottlenecks but…
HPC systems expose many configuration parameters that jointly drive competing objectives. Existing tools such as autotuners recommend good configurations but do not identify minimal changes for a near-miss configuration to meet a…
As modern LLMs support thousands to millions of tokens, KV caches grow to hundreds of gigabytes, stressing memory capacity and bandwidth. Existing solutions, such as KV cache pruning and offloading, alleviate these but underutilize hardware…
Large Language Model (LLM) deployment is increasingly shifting to cost-efficient accelerators like Google's Tensor Processing Units (TPUs), prioritizing both performance and total cost of ownership (TCO). However, existing LLM inference…
In real-world Tool-Integrated Reasoning (TIR) scenarios, where LLMs interleave reasoning with external tool calls, a major source of inefficiency is that the toolcalls create pauses between LLM requests and cause KV-Cache eviction, forcing…
Modern NVIDIA GPUs like the H100 (HBM2e) and H200 (HBM3e) share similar compute characteristics but differ significantly in memory interface technology and bandwidth. By isolating memory bandwidth as a key variable, the power distribution…
The rapid adoption of Large Language Models (LLMs) has made GPU inference efficiency an increasingly critical system concern. The runtime of LLM workloads is largely dominated by tile-based kernels, particularly General Matrix…
Large vision-language models (VLMs) are enabling interactive video reasoning, giving rise to streaming long-video understanding. In this setting, frames arrive continuously, while the system preserves long-term context and generates…
Linear layers hold most of a transformer's parameters. We replace each linear layer with one that stores $K$ out of $mn$ two-dimensional DCT coefficients per weight matrix and reconstructs the full matrix through an inverse DCT at every…
On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of…
We present memshare\footnote{The Software package is published as a CRAN package under https://CRAN.R-project.org/package=memshare, a package that enables shared memory multicore computation in R by allocating buffers in C++ shared memory…
An $N$-point FFT admits many valid implementations that differ in radix choice, stage ordering, and register-blocking strategy. These alternatives use different SIMD instruction mixes with different latencies, yet produce the same…