English
Related papers

Related papers: Exploring GPU Stream-Aware Message Passing using T…

200 papers

Modern heterogeneous supercomputing systems are comprised of CPUs, GPUs, and high-speed network interconnects. Communication libraries supporting efficient data transfers involving memory buffers from the GPU memory typically require the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-29 Naveen Namashivayam , Krishna Kandalla , James B White , Larry Kaplan , Mark Pagel

Compute nodes on modern heterogeneous supercomputing systems comprise CPUs, GPUs, and high-speed network interconnects (NICs). Parallelization is identified as a technique for effectively utilizing these systems to execute scalable…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-01 Naveen Namashivayam

GPU-enhanced architectures are now dominant in HPC systems, but message-passing communication involving GPUs with MPI has proven to be both complex and expensive, motivating new approaches that lower such costs. We compare and contrast…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-01 Patrick G. Bridges , Anthony Skjellum , Evan D. Suggs , Derek Schafer , Purushotham V. Bangalore

Supercomputer architectures are trending toward higher computational throughput due to the inclusion of heterogeneous compute nodes. These multi-GPU nodes increase on-node computational efficiency, while also increasing the amount of data…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-14 Shelby Lockhart , Amanda Bienz , William D. Gropp , Luke N. Olson

In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks,…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-24 Didem Unat , Ilyas Turimbetov , Mohammed Kefah Taha Issa , Doğan Sağbili , Flavio Vella , Daniele De Sensi , Ismayil Ismayilov

Communication among devices in multi-GPU systems plays an important role in terms of performance and scalability. In order to optimize an application, programmers need to know the type and amount of the communication happening among GPUs.…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-22 Muhammet Abdullah Soyturk , Palwisha Akhtar , Erhan Tezcan , Didem Unat

Distributed memory programming is the established paradigm used in high-performance computing (HPC) systems, requiring explicit communication between nodes and devices. When FPGAs are deployed in distributed settings, communication is…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-07 Tiziano De Matteis , Johannes de Fine Licht , Jakub Beránek , Torsten Hoefler

High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the lack of deep…

Hardware Architecture · Computer Science 2019-08-26 Ang Li , Shuaiwen Leon Song , Jieyang Chen , Jiajia Li , Xu Liu , Nathan Tallent , Kevin Barker

Distributed GPU applications increasingly rely on kernel-level, cross-node coordination to reduce launch overheads and improve compute-communication overlap, but such support is lacking. On OFI-based interconnects such as HPE Slingshot,…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-27 Baodi Shan , Mauricio Araya-Polo , Barbara Chapman

This paper provides an in-depth characterization of GPU-accelerated systems, to understand the interplay between overlapping computation and communication which is commonly employed in distributed training settings. Due to the large size of…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-08 Seonho Lee , Jihwan Oh , Junkyum Kim , Seokjin Go , Jongse Park , Divya Mahajan

Massive data sets have radically changed our understanding of how to design efficient algorithms; the streaming paradigm, whether it in terms of number of passes of an external memory algorithm, or the single pass and limited memory of a…

Graphics · Computer Science 2007-05-23 Suresh Venkatasubramanian

Asynchronous tasks, when created with over-decomposition, enable automatic computation-communication overlap which can substantially improve performance and scalability. This is not only applicable to traditional CPU-based systems, but also…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-23 Jaemin Choi , David F. Richards , Laxmikant V. Kale

Diverse scientific and engineering research areas deal with discrete, time-stamped changes in large systems of interacting delay differential equations. Simulating such complex systems at scale on high-performance computing clusters demands…

The STCP (Stream Control Transmission Protocol) is a candidate for a new transport layer protocol that may replace the TCP (Transmission Control Protocol) and the UDP (User Datagram Protocol) protocols in future IP networks. Currently, the…

Cryptography and Security · Computer Science 2011-04-19 Wojciech Fraczek , Wojciech Mazurczyk , Krzysztof Szczypiorski

As graph analytics often involves compute-intensive operations, GPUs have been extensively used to accelerate the processing. However, in many applications such as social networks, cyber security, and fraud detection, their representative…

Data Structures and Algorithms · Computer Science 2018-06-28 Mo Sha , Yuchen Li , Bingsheng He , Kian-Lee Tan

Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths,…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-11 Johannes Pekkilä , Miikka S. Väisälä , Maarit J. Käpylä , Matthias Rheinhardt , Oskar Lappi

Modern GPUs support special protocols to exchange data directly across the PCI Express bus. While these protocols could be used to reduce GPU data transmission times, basically by avoiding staging to host memory, they require specific…

The continued growth of the computational capability of throughput processors has made throughput processors the platform of choice for a wide variety of high performance computing applications. Graphics Processing Units (GPUs) are a prime…

Hardware Architecture · Computer Science 2018-05-01 Rachata Ausavarungnirun

Large inter-GPU all-reduce operations, prevalent throughout deep learning, are bottlenecked by communication costs. Emerging heterogeneous architectures are comprised of complex nodes, often containing $4$ GPUs and dozens to hundreds of CPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-26 Michael Adams , Amanda Bienz

In the quest for highest performance in scientific computing, we present a novel framework that relies on high-bandwidth communication between GPUs in a compute cluster. The framework offers linear scaling of performance for explicit…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-16 Martin Rose , Simon Homes , Lukas Ramsperger , Jose Gracia , Christoph Niethammer , Jadran Vrabec
‹ Prev 1 2 3 10 Next ›