Related papers: Accelerating Communication for Parallel Programmin…

Accelerating Intra-Node GPU-to-GPU Communication Through Multi-Path Transfers with CUDA Graphs

Effective intra-node GPU communication is essential for optimizing performance in MPI-based HPC applications, especially when leveraging multiple communication paths. In this study, we propose a novel approach that integrates CUDA Graphs…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-28 Amirhossein Sojoodi , Yiltan Hassan Temucin , Amirreza Baratisedeh , Hamed Sharifian , Ahmad Afsahi

Improving Scalability with GPU-Aware Asynchronous Tasks

Asynchronous tasks, when created with over-decomposition, enable automatic computation-communication overlap which can substantially improve performance and scalability. This is not only applicable to traditional CPU-based systems, but also…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-23 Jaemin Choi , David F. Richards , Laxmikant V. Kale

Towards an Adaptive Runtime System for Cloud-Native HPC

The ongoing convergence of HPC and cloud computing presents a fundamental challenge: HPC applications, designed for static and homogeneous supercomputers, are ill-suited for the dynamic, heterogeneous, and volatile nature of the cloud.…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-17 Aditya Bhosale , Advait Tahilyani , Laxmikant Kale , Sara Kokkila-Schumacher

Efficient MPI-based Communication for GPU-Accelerated Dask Applications

Dask is a popular parallel and distributed computing framework, which rivals Apache Spark to enable task-based scalable processing of big data. The Dask Distributed library forms the basis of this computing engine and provides support for…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-25 Aamir Shafi , Jahanzeb Maqbool Hashmi , Hari Subramoni , Dhabaleswar K. Panda

The Landscape of GPU-Centric Communication

In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks,…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-24 Didem Unat , Ilyas Turimbetov , Mohammed Kefah Taha Issa , Doğan Sağbili , Flavio Vella , Daniele De Sensi , Ismayil Ismayilov

Co-Design and Evaluation of a CPU-Free MPI GPU Communication Abstraction and Implementation

Removing the CPU from the communication fast path is essential to efficient GPU-based ML and HPC application performance. However, existing GPU communication APIs either continue to rely on the CPU for communication or rely on APIs that…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-06 Patrick G. Bridges , Derek Schafer , Jack Lange , James B. White , Anthony Skjellum , Evan Suggs , Thomas Hines , Purushotham Bangalore , Matthew G. F. Dosanjh , Whit Schonbein

Scalable communication for high-order stencil computations using CUDA-aware MPI

Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths,…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-11 Johannes Pekkilä , Miikka S. Väisälä , Maarit J. Käpylä , Matthias Rheinhardt , Oskar Lappi

Efficient and Portable Support for Overdecomposition on Distributed Memory GPGPU Platforms

Overdecomposition has emerged as a powerful and sometimes essential technique in parallel programming. Many application domains or frameworks, including those based on adaptive mesh refinements, or tree codes use it. Charm++ is a parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-14 Aditya Bhosale , Anant Jain , Shourya Goel , Ritvik Rao , Peddoju Sateesh Kumar , Laxmikant Kale

GPU-centric Communication Schemes for HPC and ML Applications

Compute nodes on modern heterogeneous supercomputing systems comprise CPUs, GPUs, and high-speed network interconnects (NICs). Parallelization is identified as a technique for effectively utilizing these systems to execute scalable…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-01 Naveen Namashivayam

ucTrace: A Multi-Layer Profiling Tool for UCX-driven Communication

UCX is a communication framework that enables low-latency, high-bandwidth communication in HPC systems. With its unified API, UCX facilitates efficient data transfers across multi-node CPU-GPU clusters. UCX is widely used as the transport…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-24 Emir Gencer , Mohammad Kefah Taha Issa , Ilyas Turimbetov , James D. Trotter , Didem Unat

Multi-GPU Quantum Circuit Simulation and the Impact of Network Performance

As is intrinsic to the fundamental goal of quantum computing, classical simulation of quantum algorithms is notoriously demanding in resource requirements. Nonetheless, simulation is critical to the success of the field and a requirement…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-01 W. Michael Brown , Anurag Ramesh , Thomas Lubinski , Thien Nguyen , David E. Bernal Neira

CGX: Adaptive System Support for Communication-Efficient Deep Learning

The ability to scale out training workloads has been one of the key performance enablers of deep learning. The main scaling approach is data-parallel GPU-based training, which has been boosted by hardware and software support for highly…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-02 Ilia Markov , Hamidreza Ramezanikebrya , Dan Alistarh

Understanding GPU Triggering APIs for MPI+X Communication

GPU-enhanced architectures are now dominant in HPC systems, but message-passing communication involving GPUs with MPI has proven to be both complex and expensive, motivating new approaches that lower such costs. We compare and contrast…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-01 Patrick G. Bridges , Anthony Skjellum , Evan D. Suggs , Derek Schafer , Purushotham V. Bangalore

Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications

This paper provides an in-depth characterization of GPU-accelerated systems, to understand the interplay between overlapping computation and communication which is commonly employed in distributed training settings. Due to the large size of…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-08 Seonho Lee , Jihwan Oh , Junkyum Kim , Seokjin Go , Jongse Park , Divya Mahajan

Collective Communication for 100k+ GPUs

The increasing scale of large language models (LLMs) necessitates highly efficient collective communication frameworks, particularly as training workloads extend to hundreds of thousands of GPUs. Traditional communication methods face…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-12 Min Si , Pavan Balaji , Yongzhou Chen , Ching-Hsiang Chu , Adi Gangidi , Saif Hasan , Subodh Iyengar , Dan Johnson , Bingzhe Liu , Regina Ren , Deep Shah , Ashmitha Jeevaraj Shetty , Greg Steinbrecher , Yulun Wang , Bruce Wu , Xinfeng Xie , Jingyi Yang , Mingran Yang , Kenny Yu , Minlan Yu , Cen Zhao , Wes Bland , Denis Boyda , Suman Gumudavelli , Prashanth Kannan , Cristian Lumezanu , Rui Miao , Zhe Qu , Venkat Ramesh , Maxim Samoylov , Jan Seidel , Srikanth Sundaresan , Feng Tian , Qiye Tan , Shuqiang Zhang , Yimeng Zhao , Shengbao Zheng , Art Zhu , Hongyi Zeng

Dynamic Load Balancing in GPU-Based Systems - Early Experiments

The dynamic load-balancing framework in Charm++/AMPI, developed at the University of Illinois, is based on using processor virtualization to allow thread migration across processors. This framework has been successfully applied to many…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-10-17 Alvaro Luiz Fazenda , Celso L. Mendes , Laxmikant V. Kale , Jairo Panetta , Eduardo Rocha Rodrigues

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achievable application…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-02 Gabin Schieffer , Ruimin Shi , Stefano Markidis , Andreas Herten , Jennifer Faj , Ivy Peng

Runtime Support for Performance Portability on Heterogeneous Distributed Platforms

Hardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware. This shift in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-03-09 Polykarpos Thomadakis , Nikos Chrisochoides

Parallel Paradigms in Modern HPC: A Comparative Analysis of MPI, OpenMP, and CUDA

This paper presents a comprehensive comparison of three dominant parallel programming models in High Performance Computing (HPC): Message Passing Interface (MPI), Open Multi-Processing (OpenMP), and Compute Unified Device Architecture…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-19 Nizar ALHafez , Ahmad Kurdi

Enhancing Performance Insight at Scale: A Heterogeneous Framework for Exascale Diagnostics

As exascale systems reach unprecedented concurrency, traditional performance analysis tools struggle with the overhead of massive-scale telemetry. We present an accelerated infrastructure for the hpcanalysis framework that leverages a…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-12 Dragana Grbic