Related papers: Dissecting GPU Memory Hierarchy through Microbench…

Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks

The rapid development in scientific research provides a need for more compute power, which is partly being solved by GPUs. This paper presents a microarchitectural analysis of the modern NVIDIA Blackwell architecture by studying GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-23 Aaron Jarmusch , Nathan Graddon , Sunita Chandrasekaran

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

Every year, novel NVIDIA GPU designs are introduced. This rapid architectural and technological progression, coupled with a reluctance by manufacturers to disclose low-level details, makes it difficult for even the most proficient GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-04-19 Zhe Jia , Marco Maggioni , Benjamin Staiger , Daniele P. Scarpazza

Techniques for Shared Resource Management in Systems with Throughput Processors

The continued growth of the computational capability of throughput processors has made throughput processors the platform of choice for a wide variety of high performance computing applications. Graphics Processing Units (GPUs) are a prime…

Hardware Architecture · Computer Science 2018-05-01 Rachata Ausavarungnirun

Dissecting the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level Analysis

This study presents a comprehensive multi-level analysis of the NVIDIA Hopper GPU architecture, focusing on its performance characteristics and novel features. We benchmark Hopper's memory subsystem, highlighting improvements in the L2…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-05 Weile Luo , Ruibo Fan , Zeyu Li , Dayou Du , Hongyuan Liu , Qiang Wang , Xiaowen Chu

Benchmarking and Dissecting the Nvidia Hopper GPU Architecture

Graphics processing units (GPUs) are continually evolving to cater to the computational demands of contemporary general-purpose workloads, particularly those driven by artificial intelligence (AI) utilizing deep learning techniques. A…

Hardware Architecture · Computer Science 2024-02-22 Weile Luo , Ruibo Fan , Zeyu Li , Dayou Du , Qiang Wang , Xiaowen Chu

Cache Bypassing for Machine Learning Algorithms

Graphics Processing Units (GPUs) were once used solely for graphical computation tasks but with the increase in the use of machine learning applications, the use of GPUs to perform general-purpose computing has increased in the last few…

Hardware Architecture · Computer Science 2021-02-16 Asim Ikram , Muhammad Awais Ali , Mirza Omer Beg

Recent Advances in Overcoming Bottlenecks in Memory Systems and Managing Memory Resources in GPU Systems

This article features extended summaries and retrospectives of some of the recent research done by our research group, SAFARI, on (1) various critical problems in memory systems and (2) how memory system bottlenecks affect graphics…

Hardware Architecture · Computer Science 2018-05-30 Onur Mutlu , Saugata Ghose , Rachata Ausavarungnirun

Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs

The last decade has seen a shift in the computer systems industry where heterogeneous computing has become prevalent. Graphics Processing Units (GPUs) are now present in supercomputers to mobile phones and tablets. GPUs are used for…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-04 Yehia Arafa , Abdel-Hameed Badawy , Gopinath Chennupati , Nandakishore Santhi , Stephan Eidenbenz

Microbenchmarking NVIDIA's Blackwell Architecture: An in-depth Architectural Analysis

As GPU architectures rapidly evolve to meet the growing demands of exascale computing and machine learning, the performance implications of architectural innovations remain poorly understood across diverse workloads. NVIDIA Blackwell (B200)…

Hardware Architecture · Computer Science 2026-03-04 Aaron Jarmusch , Sunita Chandrasekaran

GPU peer-to-peer techniques applied to a cluster interconnect

Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific…

Computational Physics · Physics 2013-08-01 Roberto Ammendola , Massimo Bernaschi , Andrea Biagioni , Mauro Bisson , Massimiliano Fatica , Ottorino Frezza , Francesca Lo Cicero , Alessandro Lonardo , Enrico Mastrostefano , Pier Stanislao Paolucci , Davide Rossetti , Francesco Simula , Laura Tosoratto , Piero Vicini

GPUMC: A Stateless Model Checker for GPU Weak Memory Concurrency

GPU computing is embracing weak memory concurrency for performance improvement. However, compared to CPUs, modern GPUs provide more fine-grained concurrency features such as scopes, have additional properties like divergence, and thereby…

Logic in Computer Science · Computer Science 2025-05-27 Soham Chakraborty , S. Krishna , Andreas Pavlogiannis , Omkar Tuppe

Understanding the Landscape of Ampere GPU Memory Errors

Graphics Processing Units (GPUs) have become a de facto solution for accelerating high-performance computing (HPC) applications. Understanding their memory error behavior is an essential step toward achieving efficient and reliable HPC…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-05 Zhu Zhu , Yu Sun , Dhatri Parakal , Bo Fang , Steven Farrell , Gregory H. Bauer , Brett Bode , Ian T. Foster , Michael E. Papka , William Gropp , Zhao Zhang , Lishan Yang

Understanding GPU Resource Interference One Level Deeper

GPUs are vastly underutilized, even when running resource-intensive AI applications, as GPU kernels within each job have diverse resource profiles that may saturate some parts of a device while often leaving other parts idle. Colocating…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-17 Paul Elvinger , Foteini Strati , Natalie Enright Jerger , Ana Klimovic

Improving Multi-Application Concurrency Support Within the GPU Memory System

GPUs exploit a high degree of thread-level parallelism to hide long-latency stalls. Due to the heterogeneous compute requirements of different applications, there is a growing need to share the GPU across multiple applications in…

Hardware Architecture · Computer Science 2017-08-17 Rachata Ausavarungnirun , Christopher J. Rossbach , Vance Miller , Joshua Landgraf , Saugata Ghose , Jayneel Gnadhi , Adwait Jog , Onur Mutlu

Characterizing Concurrency Mechanisms for NVIDIA GPUs under Deep Learning Workloads

We investigate the performance of the concurrency mechanisms available on NVIDIA's new Ampere GPU microarchitecture under deep learning training and inference workloads. In contrast to previous studies that treat the GPU as a black box, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-04 Guin Gilman , Robert J. Walls

Performance Impact of Memory Channels on Sparse and Irregular Algorithms

Graph processing is typically considered to be a memory-bound rather than compute-bound problem. One common line of thought is that more available memory bandwidth corresponds to better graph processing performance. However, in this work we…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-10 Oded Green , James Fox , Jeffrey Young , Jun Shirako , David Bader

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achievable application…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-02 Gabin Schieffer , Ruimin Shi , Stefano Markidis , Andreas Herten , Jennifer Faj , Ivy Peng

GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations

Collocating deep learning training tasks improves GPU utilization but risks resource contention, severe slowdowns, and out-of-memory (OOM) failures. Accurate memory estimation is essential for robust collocation, and GPU utilization…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-29 Ehsan Yousefzadeh-Asl-Miandoab , Reza Karimzadeh , Danyal Yorulmaz , Bulat Ibragimov , Pınar Tözün

Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures : Algorithms and Experiments

Architectures with multiple classes of memory media are becoming a common part of mainstream supercomputer deployments. So called multi-level memories offer differing characteristics for each memory component including variation in…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-04-04 Mehmet Deveci , Simon D. Hammond , Michael M. Wolf , Sivasankaran Rajamanickam

GEMEL: Model Merging for Memory-Efficient, Real-Time Video Analytics at the Edge

Video analytics pipelines have steadily shifted to edge deployments to reduce bandwidth overheads and privacy violations, but in doing so, face an ever-growing resource tension. Most notably, edge-box GPUs lack the memory needed to…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-06 Arthi Padmanabhan , Neil Agarwal , Anand Iyer , Ganesh Ananthanarayanan , Yuanchao Shu , Nikolaos Karianakis , Guoqing Harry Xu , Ravi Netravali