Related papers: UCX Programming Interface for Remote Function Inje…

cMPI: Using CXL Memory Sharing for MPI One-Sided and Two-Sided Inter-Node Communications

Message Passing Interface (MPI) is a foundational programming model for high-performance computing. MPI libraries traditionally employ network interconnects (e.g., Ethernet and InfiniBand) and network protocols (e.g., TCP and RoCE) with…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-16 Xi Wang , Bin Ma , Jongryool Kim , Byungil Koh , Hoshik Kim , Dong Li

Bring the BitCODE -- Moving Compute and Data in Distributed Heterogeneous Systems

In this paper, we present a framework for moving compute and data between processing elements in a distributed heterogeneous system. The implementation of the framework is based on the LLVM compiler toolchain combined with the UCX…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-13 Wenbin Lu , Luis E. Peña , Pavel Shamis , Valentin Churavy , Barbara Chapman , Steve Poole

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-01 Robert Gerstenberger , Maciej Besta , Torsten Hoefler

ucTrace: A Multi-Layer Profiling Tool for UCX-driven Communication

UCX is a communication framework that enables low-latency, high-bandwidth communication in HPC systems. With its unified API, UCX facilitates efficient data transfers across multi-node CPU-GPU clusters. UCX is widely used as the transport…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-24 Emir Gencer , Mohammad Kefah Taha Issa , Ilyas Turimbetov , James D. Trotter , Didem Unat

Extending the Message Passing Interface (MPI) with User-Level Schedules

Composability is one of seven reasons for the long-standing and continuing success of MPI. Extending MPI by composing its operations with user-level operations provides useful integration with the progress engine and completion notification…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-27 Derek Schafer , Sheikh Ghafoor , Daniel Holmes , Martin Ruefenacht , Anthony Skjellum

Two-Chains: High Performance Framework for Function Injection and Execution

Some important problems, such as semantic graph analysis, require large-scale irregular applications composed of many coordinating tasks that operate on a shared data set so big it has to be stored on many physical devices. In these cases,…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-01 Megan Grodowitz , Luis E. Peña , Curtis Dunham , Dong Zhong , Pavel Shamis , Steve Poole

Accelerating Intra-Node GPU-to-GPU Communication Through Multi-Path Transfers with CUDA Graphs

Effective intra-node GPU communication is essential for optimizing performance in MPI-based HPC applications, especially when leveraging multiple communication paths. In this study, we propose a novel approach that integrates CUDA Graphs…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-28 Amirhossein Sojoodi , Yiltan Hassan Temucin , Amirreza Baratisedeh , Hamed Sharifian , Ahmad Afsahi

Building Blocks for Network-Accelerated Distributed File Systems

High-performance clusters and datacenters pose increasingly demanding requirements on storage systems. If these systems do not operate at scale, applications are doomed to become I/O bound and waste compute cycles. To accelerate the data…

Networking and Internet Architecture · Computer Science 2022-06-22 Salvatore Di Girolamo , Daniele De Sensi , Konstantin Taranov , Milos Malesevic , Maciej Besta , Timo Schneider , Severin Kistler , Torsten Hoefler

Towards In-transit Analysis on Supercomputing Environments

The drive towards exascale computing is opening an enormous opportunity for more realistic and precise simulations of natural phenomena. The process of simulation, however, involves not only the numerical computation of predictions but also…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-21 Allan Santos , Hermano Lustosa , Fabio Porto , Bruno Schulze

Streaming Message Interface: High-Performance Distributed Memory Programming on Reconfigurable Hardware

Distributed memory programming is the established paradigm used in high-performance computing (HPC) systems, requiring explicit communication between nodes and devices. When FPGAs are deployed in distributed settings, communication is…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-07 Tiziano De Matteis , Johannes de Fine Licht , Jakub Beránek , Torsten Hoefler

RDMA vs. RPC for Implementing Distributed Data Structures

Distributed data structures are key to implementing scalable applications for scientific simulations and data analysis. In this paper we look at two implementation styles for distributed data structures: remote direct memory access (RDMA)…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-16 Benjamin Brock , Yuxin Chen , Jiakun Yan , John D. Owens , Aydın Buluç , Katherine Yelick

Specx: a C++ task-based runtime system for heterogeneous distributed architectures

Parallelization is needed everywhere, from laptops and mobile phones to supercomputers. Among parallel programming models, task-based programming has demonstrated a powerful potential and is widely used in high-performance scientific…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-18 Paul Cardosi , Bérenger Bramas

Integration of CUDA Processing within the C++ library for parallelism and concurrency (HPX)

Experience shows that on today's high performance systems the utilization of different acceleration cards in conjunction with a high utilization of all other parts of the system is difficult. Future architectures, like exascale clusters,…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-03-07 Patrick Diehl , Madhavan Seshadri , Thomas Heller , Hartmut Kaiser

Lightweight Syntactic API Usage Analysis with UCov

Designing an effective API is essential for library developers as it is the lens through which clients will judge its usability and benefits, as well as the main friction point when the library evolves. Despite its importance, defining the…

Software Engineering · Computer Science 2024-02-20 Gustave Monce , Thomas Couturou , Yasmine Hamdaoui , Thomas Degueule , Jean-Rémy Falleri

MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface

Application development for distributed computing "Grids" can benefit from tools that variously hide or enable application-level management of critical aspects of the heterogeneous environment. As part of an investigation of these issues,…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 N. T. Karonis , B. Toonen , I. Foster

NetRPC: Enabling In-Network Computation in Remote Procedure Calls

People have shown that in-network computation (INC) significantly boosts performance in many application scenarios include distributed training, MapReduce, agreement, and network monitoring. However, existing INC programming is unfriendly…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-19 Bohan Zhao , Wenfei Wu , Wei Xu

The PetscSF Scalable Communication Layer

PetscSF, the communication component of the Portable, Extensible Toolkit for Scientific Computation (PETSc), is designed to provide PETSc's communication infrastructure suitable for exascale computers that utilize GPUs and other…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-25 Junchao Zhang , Jed Brown , Satish Balay , Jacob Faibussowitsch , Matthew Knepley , Oana Marin , Richard Tran Mills , Todd Munson , Barry F. Smith , Stefano Zampini

High-performance symbolic-numerics via multiple dispatch

As mathematical computing becomes more democratized in high-level languages, high-performance symbolic-numeric systems are necessary for domain scientists and engineers to get the best performance out of their machine without deep knowledge…

Computation and Language · Computer Science 2022-02-08 Shashi Gowda , Yingbo Ma , Alessandro Cheli , Maja Gwozdz , Viral B. Shah , Alan Edelman , Christopher Rackauckas

An asynchronous and task-based implementation of Peridynamics utilizing HPX -- the C++ standard library for parallelism and concurrency

On modern supercomputers, asynchronous many task systems are emerging to address the new architecture of computational nodes. Through this shift of increasing cores per node, a new programming model with the focus on handle the fine-grain…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-02 Patrick Diehl , Prashant K. Jha , Hartmut Kaiser , Robert Lipton , Martin Levesque

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achievable application…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-02 Gabin Schieffer , Ruimin Shi , Stefano Markidis , Andreas Herten , Jennifer Faj , Ivy Peng