English
Related papers

Related papers: Dissecting GPU Memory Hierarchy through Microbench…

200 papers

The rapid development in scientific research provides a need for more compute power, which is partly being solved by GPUs. This paper presents a microarchitectural analysis of the modern NVIDIA Blackwell architecture by studying GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-23 Aaron Jarmusch , Nathan Graddon , Sunita Chandrasekaran

Every year, novel NVIDIA GPU designs are introduced. This rapid architectural and technological progression, coupled with a reluctance by manufacturers to disclose low-level details, makes it difficult for even the most proficient GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-04-19 Zhe Jia , Marco Maggioni , Benjamin Staiger , Daniele P. Scarpazza

The continued growth of the computational capability of throughput processors has made throughput processors the platform of choice for a wide variety of high performance computing applications. Graphics Processing Units (GPUs) are a prime…

Hardware Architecture · Computer Science 2018-05-01 Rachata Ausavarungnirun

This study presents a comprehensive multi-level analysis of the NVIDIA Hopper GPU architecture, focusing on its performance characteristics and novel features. We benchmark Hopper's memory subsystem, highlighting improvements in the L2…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-05 Weile Luo , Ruibo Fan , Zeyu Li , Dayou Du , Hongyuan Liu , Qiang Wang , Xiaowen Chu

Graphics processing units (GPUs) are continually evolving to cater to the computational demands of contemporary general-purpose workloads, particularly those driven by artificial intelligence (AI) utilizing deep learning techniques. A…

Hardware Architecture · Computer Science 2024-02-22 Weile Luo , Ruibo Fan , Zeyu Li , Dayou Du , Qiang Wang , Xiaowen Chu

Graphics Processing Units (GPUs) were once used solely for graphical computation tasks but with the increase in the use of machine learning applications, the use of GPUs to perform general-purpose computing has increased in the last few…

Hardware Architecture · Computer Science 2021-02-16 Asim Ikram , Muhammad Awais Ali , Mirza Omer Beg

This article features extended summaries and retrospectives of some of the recent research done by our research group, SAFARI, on (1) various critical problems in memory systems and (2) how memory system bottlenecks affect graphics…

Hardware Architecture · Computer Science 2018-05-30 Onur Mutlu , Saugata Ghose , Rachata Ausavarungnirun

The last decade has seen a shift in the computer systems industry where heterogeneous computing has become prevalent. Graphics Processing Units (GPUs) are now present in supercomputers to mobile phones and tablets. GPUs are used for…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-04 Yehia Arafa , Abdel-Hameed Badawy , Gopinath Chennupati , Nandakishore Santhi , Stephan Eidenbenz

As GPU architectures rapidly evolve to meet the growing demands of exascale computing and machine learning, the performance implications of architectural innovations remain poorly understood across diverse workloads. NVIDIA Blackwell (B200)…

Hardware Architecture · Computer Science 2026-03-04 Aaron Jarmusch , Sunita Chandrasekaran

Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific…

GPU computing is embracing weak memory concurrency for performance improvement. However, compared to CPUs, modern GPUs provide more fine-grained concurrency features such as scopes, have additional properties like divergence, and thereby…

Logic in Computer Science · Computer Science 2025-05-27 Soham Chakraborty , S. Krishna , Andreas Pavlogiannis , Omkar Tuppe

Graphics Processing Units (GPUs) have become a de facto solution for accelerating high-performance computing (HPC) applications. Understanding their memory error behavior is an essential step toward achieving efficient and reliable HPC…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-05 Zhu Zhu , Yu Sun , Dhatri Parakal , Bo Fang , Steven Farrell , Gregory H. Bauer , Brett Bode , Ian T. Foster , Michael E. Papka , William Gropp , Zhao Zhang , Lishan Yang

GPUs are vastly underutilized, even when running resource-intensive AI applications, as GPU kernels within each job have diverse resource profiles that may saturate some parts of a device while often leaving other parts idle. Colocating…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-17 Paul Elvinger , Foteini Strati , Natalie Enright Jerger , Ana Klimovic

GPUs exploit a high degree of thread-level parallelism to hide long-latency stalls. Due to the heterogeneous compute requirements of different applications, there is a growing need to share the GPU across multiple applications in…

We investigate the performance of the concurrency mechanisms available on NVIDIA's new Ampere GPU microarchitecture under deep learning training and inference workloads. In contrast to previous studies that treat the GPU as a black box, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-04 Guin Gilman , Robert J. Walls

Graph processing is typically considered to be a memory-bound rather than compute-bound problem. One common line of thought is that more available memory bandwidth corresponds to better graph processing performance. However, in this work we…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-10 Oded Green , James Fox , Jeffrey Young , Jun Shirako , David Bader

Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achievable application…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-02 Gabin Schieffer , Ruimin Shi , Stefano Markidis , Andreas Herten , Jennifer Faj , Ivy Peng

Collocating deep learning training tasks improves GPU utilization but risks resource contention, severe slowdowns, and out-of-memory (OOM) failures. Accurate memory estimation is essential for robust collocation, and GPU utilization…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-29 Ehsan Yousefzadeh-Asl-Miandoab , Reza Karimzadeh , Danyal Yorulmaz , Bulat Ibragimov , Pınar Tözün

Architectures with multiple classes of memory media are becoming a common part of mainstream supercomputer deployments. So called multi-level memories offer differing characteristics for each memory component including variation in…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-04-04 Mehmet Deveci , Simon D. Hammond , Michael M. Wolf , Sivasankaran Rajamanickam

Video analytics pipelines have steadily shifted to edge deployments to reduce bandwidth overheads and privacy violations, but in doing so, face an ever-growing resource tension. Most notably, edge-box GPUs lack the memory needed to…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-06 Arthi Padmanabhan , Neil Agarwal , Anand Iyer , Ganesh Ananthanarayanan , Yuanchao Shu , Nikolaos Karianakis , Guoqing Harry Xu , Ravi Netravali
‹ Prev 1 2 3 10 Next ›