English
Related papers

Related papers: Accelerating Communication for Parallel Programmin…

200 papers

Effective intra-node GPU communication is essential for optimizing performance in MPI-based HPC applications, especially when leveraging multiple communication paths. In this study, we propose a novel approach that integrates CUDA Graphs…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-28 Amirhossein Sojoodi , Yiltan Hassan Temucin , Amirreza Baratisedeh , Hamed Sharifian , Ahmad Afsahi

Asynchronous tasks, when created with over-decomposition, enable automatic computation-communication overlap which can substantially improve performance and scalability. This is not only applicable to traditional CPU-based systems, but also…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-23 Jaemin Choi , David F. Richards , Laxmikant V. Kale

The ongoing convergence of HPC and cloud computing presents a fundamental challenge: HPC applications, designed for static and homogeneous supercomputers, are ill-suited for the dynamic, heterogeneous, and volatile nature of the cloud.…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-17 Aditya Bhosale , Advait Tahilyani , Laxmikant Kale , Sara Kokkila-Schumacher

Dask is a popular parallel and distributed computing framework, which rivals Apache Spark to enable task-based scalable processing of big data. The Dask Distributed library forms the basis of this computing engine and provides support for…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-25 Aamir Shafi , Jahanzeb Maqbool Hashmi , Hari Subramoni , Dhabaleswar K. Panda

In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks,…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-24 Didem Unat , Ilyas Turimbetov , Mohammed Kefah Taha Issa , Doğan Sağbili , Flavio Vella , Daniele De Sensi , Ismayil Ismayilov

Removing the CPU from the communication fast path is essential to efficient GPU-based ML and HPC application performance. However, existing GPU communication APIs either continue to rely on the CPU for communication or rely on APIs that…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-06 Patrick G. Bridges , Derek Schafer , Jack Lange , James B. White , Anthony Skjellum , Evan Suggs , Thomas Hines , Purushotham Bangalore , Matthew G. F. Dosanjh , Whit Schonbein

Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths,…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-11 Johannes Pekkilä , Miikka S. Väisälä , Maarit J. Käpylä , Matthias Rheinhardt , Oskar Lappi

Overdecomposition has emerged as a powerful and sometimes essential technique in parallel programming. Many application domains or frameworks, including those based on adaptive mesh refinements, or tree codes use it. Charm++ is a parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-14 Aditya Bhosale , Anant Jain , Shourya Goel , Ritvik Rao , Peddoju Sateesh Kumar , Laxmikant Kale

Compute nodes on modern heterogeneous supercomputing systems comprise CPUs, GPUs, and high-speed network interconnects (NICs). Parallelization is identified as a technique for effectively utilizing these systems to execute scalable…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-01 Naveen Namashivayam

UCX is a communication framework that enables low-latency, high-bandwidth communication in HPC systems. With its unified API, UCX facilitates efficient data transfers across multi-node CPU-GPU clusters. UCX is widely used as the transport…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-24 Emir Gencer , Mohammad Kefah Taha Issa , Ilyas Turimbetov , James D. Trotter , Didem Unat

As is intrinsic to the fundamental goal of quantum computing, classical simulation of quantum algorithms is notoriously demanding in resource requirements. Nonetheless, simulation is critical to the success of the field and a requirement…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-01 W. Michael Brown , Anurag Ramesh , Thomas Lubinski , Thien Nguyen , David E. Bernal Neira

The ability to scale out training workloads has been one of the key performance enablers of deep learning. The main scaling approach is data-parallel GPU-based training, which has been boosted by hardware and software support for highly…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-02 Ilia Markov , Hamidreza Ramezanikebrya , Dan Alistarh

GPU-enhanced architectures are now dominant in HPC systems, but message-passing communication involving GPUs with MPI has proven to be both complex and expensive, motivating new approaches that lower such costs. We compare and contrast…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-01 Patrick G. Bridges , Anthony Skjellum , Evan D. Suggs , Derek Schafer , Purushotham V. Bangalore

This paper provides an in-depth characterization of GPU-accelerated systems, to understand the interplay between overlapping computation and communication which is commonly employed in distributed training settings. Due to the large size of…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-08 Seonho Lee , Jihwan Oh , Junkyum Kim , Seokjin Go , Jongse Park , Divya Mahajan

The increasing scale of large language models (LLMs) necessitates highly efficient collective communication frameworks, particularly as training workloads extend to hundreds of thousands of GPUs. Traditional communication methods face…

The dynamic load-balancing framework in Charm++/AMPI, developed at the University of Illinois, is based on using processor virtualization to allow thread migration across processors. This framework has been successfully applied to many…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-10-17 Alvaro Luiz Fazenda , Celso L. Mendes , Laxmikant V. Kale , Jairo Panetta , Eduardo Rocha Rodrigues

Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achievable application…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-02 Gabin Schieffer , Ruimin Shi , Stefano Markidis , Andreas Herten , Jennifer Faj , Ivy Peng

Hardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware. This shift in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-03-09 Polykarpos Thomadakis , Nikos Chrisochoides

This paper presents a comprehensive comparison of three dominant parallel programming models in High Performance Computing (HPC): Message Passing Interface (MPI), Open Multi-Processing (OpenMP), and Compute Unified Device Architecture…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-19 Nizar ALHafez , Ahmad Kurdi

As exascale systems reach unprecedented concurrency, traditional performance analysis tools struggle with the overhead of massive-scale telemetry. We present an accelerated infrastructure for the hpcanalysis framework that leverages a…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-12 Dragana Grbic
‹ Prev 1 2 3 10 Next ›