Related papers: Exploring GPU Stream-Aware Message Passing using T…

Exploring Fully Offloaded GPU Stream-Aware Message Passing

Modern heterogeneous supercomputing systems are comprised of CPUs, GPUs, and high-speed network interconnects. Communication libraries supporting efficient data transfers involving memory buffers from the GPU memory typically require the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-29 Naveen Namashivayam , Krishna Kandalla , James B White , Larry Kaplan , Mark Pagel

GPU-centric Communication Schemes for HPC and ML Applications

Compute nodes on modern heterogeneous supercomputing systems comprise CPUs, GPUs, and high-speed network interconnects (NICs). Parallelization is identified as a technique for effectively utilizing these systems to execute scalable…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-01 Naveen Namashivayam

Understanding GPU Triggering APIs for MPI+X Communication

GPU-enhanced architectures are now dominant in HPC systems, but message-passing communication involving GPUs with MPI has proven to be both complex and expensive, motivating new approaches that lower such costs. We compare and contrast…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-01 Patrick G. Bridges , Anthony Skjellum , Evan D. Suggs , Derek Schafer , Purushotham V. Bangalore

Characterizing the Performance of Node-Aware Strategies for Irregular Point-to-Point Communication on Heterogeneous Architectures

Supercomputer architectures are trending toward higher computational throughput due to the inclusion of heterogeneous compute nodes. These multi-GPU nodes increase on-node computational efficiency, while also increasing the amount of data…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-14 Shelby Lockhart , Amanda Bienz , William D. Gropp , Luke N. Olson

The Landscape of GPU-Centric Communication

In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks,…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-24 Didem Unat , Ilyas Turimbetov , Mohammed Kefah Taha Issa , Doğan Sağbili , Flavio Vella , Daniele De Sensi , Ismayil Ismayilov

Monitoring Collective Communication Among GPUs

Communication among devices in multi-GPU systems plays an important role in terms of performance and scalability. In order to optimize an application, programmers need to know the type and amount of the communication happening among GPUs.…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-22 Muhammet Abdullah Soyturk , Palwisha Akhtar , Erhan Tezcan , Didem Unat

Streaming Message Interface: High-Performance Distributed Memory Programming on Reconfigurable Hardware

Distributed memory programming is the established paradigm used in high-performance computing (HPC) systems, requiring explicit communication between nodes and devices. When FPGAs are deployed in distributed settings, communication is…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-07 Tiziano De Matteis , Johannes de Fine Licht , Jakub Beránek , Torsten Hoefler

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the lack of deep…

Hardware Architecture · Computer Science 2019-08-26 Ang Li , Shuaiwen Leon Song , Jieyang Chen , Jiajia Li , Xu Liu , Nathan Tallent , Kevin Barker

GICC: A High-Performance Runtime for GPU-Initiated Communication and Coordination in Modern HPC Systems

Distributed GPU applications increasingly rely on kernel-level, cross-node coordination to reduce launch overheads and improve compute-communication overlap, but such support is lacking. On OFI-based interconnects such as HPE Slingshot,…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-27 Baodi Shan , Mauricio Araya-Polo , Barbara Chapman

Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications

This paper provides an in-depth characterization of GPU-accelerated systems, to understand the interplay between overlapping computation and communication which is commonly employed in distributed training settings. Due to the large size of…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-08 Seonho Lee , Jihwan Oh , Junkyum Kim , Seokjin Go , Jongse Park , Divya Mahajan

The Graphics Card as a Streaming Computer

Massive data sets have radically changed our understanding of how to design efficient algorithms; the streaming paradigm, whether it in terms of number of passes of an external memory algorithm, or the single pass and limited memory of a…

Graphics · Computer Science 2007-05-23 Suresh Venkatasubramanian

Improving Scalability with GPU-Aware Asynchronous Tasks

Asynchronous tasks, when created with over-decomposition, enable automatic computation-communication overlap which can substantially improve performance and scalability. This is not only applicable to traditional CPU-based systems, but also…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-23 Jaemin Choi , David F. Richards , Laxmikant V. Kale

Scalable Construction of Spiking Neural Networks using up to thousands of GPUs

Diverse scientific and engineering research areas deal with discrete, time-stamped changes in large systems of interacting delay differential equations. Simulating such complex systems at scale on high-performance computing clusters demands…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-18 Bruno Golosio , Gianmarco Tiddia , José Villamar , Luca Pontisso , Luca Sergi , Francesco Simula , Pooja Babu , Elena Pastorelli , Abigail Morrison , Markus Diesmann , Alessandro Lonardo , Pier Stanislao Paolucci , Johanna Senk

Hiding Information in a Stream Control Transmission Protocol

The STCP (Stream Control Transmission Protocol) is a candidate for a new transport layer protocol that may replace the TCP (Transmission Control Protocol) and the UDP (User Datagram Protocol) protocols in future IP networks. Currently, the…

Cryptography and Security · Computer Science 2011-04-19 Wojciech Fraczek , Wojciech Mazurczyk , Krzysztof Szczypiorski

Technical Report: Accelerating Dynamic Graph Analytics on GPUs

As graph analytics often involves compute-intensive operations, GPUs have been extensively used to accelerate the processing. However, in many applications such as social networks, cyber security, and fraud detection, their representative…

Data Structures and Algorithms · Computer Science 2018-06-28 Mo Sha , Yuchen Li , Bingsheng He , Kian-Lee Tan

Scalable communication for high-order stencil computations using CUDA-aware MPI

Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths,…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-11 Johannes Pekkilä , Miikka S. Väisälä , Maarit J. Käpylä , Matthias Rheinhardt , Oskar Lappi

GPU peer-to-peer techniques applied to a cluster interconnect

Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific…

Computational Physics · Physics 2013-08-01 Roberto Ammendola , Massimo Bernaschi , Andrea Biagioni , Mauro Bisson , Massimiliano Fatica , Ottorino Frezza , Francesca Lo Cicero , Alessandro Lonardo , Enrico Mastrostefano , Pier Stanislao Paolucci , Davide Rossetti , Francesco Simula , Laura Tosoratto , Piero Vicini

Techniques for Shared Resource Management in Systems with Throughput Processors

The continued growth of the computational capability of throughput processors has made throughput processors the platform of choice for a wide variety of high performance computing applications. Graphics Processing Units (GPUs) are a prime…

Hardware Architecture · Computer Science 2018-05-01 Rachata Ausavarungnirun

Optimizing Allreduce Operations for Modern Heterogeneous Architectures with Multiple Processes per GPU

Large inter-GPU all-reduce operations, prevalent throughout deep learning, are bottlenecked by communication costs. Emerging heterogeneous architectures are comprised of complex nodes, often containing $4$ GPUs and dozens to hundreds of CPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-26 Michael Adams , Amanda Bienz

Cyclic Data Streaming on GPUs for Short Range Stencils Applied to Molecular Dynamics

In the quest for highest performance in scientific computing, we present a novel framework that relies on high-bandwidth communication between GPUs in a compute cluster. The framework offers linear scaling of performance for explicit…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-16 Martin Rose , Simon Homes , Lukas Ramsperger , Jose Gracia , Christoph Niethammer , Jadran Vrabec