Related papers: Implementing Efficient Message Logging Protocols a…

Implicit Actions and Non-blocking Failure Recovery with MPI

Scientific applications have long embraced the MPI as the environment of choice to execute on large distributed systems. The User-Level Failure Mitigation (ULFM) specification extends the MPI standard to address resilience and enable MPI…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-27 Aurelien Bouteiller , George Bosilca

Extending the Message Passing Interface (MPI) with User-Level Schedules

Composability is one of seven reasons for the long-standing and continuing success of MPI. Extending MPI by composing its operations with user-level operations provides useful integration with the progress engine and completion notification…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-27 Derek Schafer , Sheikh Ghafoor , Daniel Holmes , Martin Ruefenacht , Anthony Skjellum

PartRePer-MPI: Combining Fault Tolerance and Performance for MPI Applications

As we have entered Exascale computing, the faults in high-performance systems are expected to increase considerably. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-26 Sarthak Joshi , Sathish Vadhiyar

FTHP-MPI: Towards Providing Replication-based Fault Tolerance in a Fault-Intolerant Native MPI Library

Faults in high-performance systems are expected to be very large in the current exascale computing era. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a much higher…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-15 Sarthak Joshi , Sathish Vadhiyar

A high-level C++ approach to manage local errors, asynchrony and faults in an MPI application

C++ advocates exceptions as the preferred way to handle unexpected behaviour of an implementation in the code. This does not integrate well with the error handling of MPI, which more or less always results in program termination in case of…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-04-16 Christian Engwer , Mirco Altenbernd , Nils-Arne Dreier , Dominik Göddeke

Legio: Fault Resiliency for Embarrassingly Parallel MPI Applications

Due to the increasing size of HPC machines, the fault presence is becoming an eventuality that applications must face. Natively, MPI provides no support for the execution past the detection of a fault, and this is becoming more and more…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-22 Roberto Rocco , Davide Gadioli , Gianluca Palermo

Modeling the Potential of Message-Free Communication via CXL.mem

Heterogeneous memory technologies are increasingly important instruments in addressing the memory wall in HPC systems. While most are deployed in single node setups, CXL.mem is a technology that implements memories that can be attached to…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-10 Stepan Vanecek , Matthew Turner , Manisha Gajbe , Matthew Wolf , Martin Schulz

User Experiences with MPI RMA and ULFM in a Resilient Key-Value Store Implementation

As hardware failures such as node losses become increasingly common, MPI programmers may want to save vulnerable data in a resilient store. While third-party storage solutions such as Redis or the Hazelcast IMap exist, a tailored, MPI-based…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-21 Claudia Fohry , Rainer Fink

Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery

Efficient utilization of today's high-performance computing (HPC) systems with complex hardware and software components requires that the HPC applications are designed to tolerate process failures at runtime. With low mean time to failure…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-16 Rizwan A. Ashraf , Saurabh Hukerikar , Christian Engelmann

MPI Advance : Open-Source Message Passing Optimizations

The large variety of production implementations of the message passing interface (MPI) each provide unique and varying underlying algorithms. Each emerging supercomputer supports one or a small number of system MPI installations, tuned for…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-15 Amanda Bienz , Derek Schafer , Anthony Skjellum

PGMPI: Automatically Verifying Self-Consistent MPI Performance Guidelines

The Message Passing Interface (MPI) is the most commonly used application programming interface for process communication on current large-scale parallel systems. Due to the scale and complexity of modern parallel architectures, it is…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-09-05 Sascha Hunold , Alexandra Carpen-Amarie , Felix Donatus Lübbe , Jesper Larsson Träff

MATCH: An MPI Fault Tolerance Benchmark Suite

MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-16 Luanzheng Guo , Giorgis Georgakoudis , Konstantinos Parasyris , Ignacio Laguna , Dong Li

MPI Progress For All

The progression of communication in the Message Passing Interface (MPI) is not well defined, yet it is critical for application performance, particularly in achieving effective computation and communication overlap. The opaque nature of MPI…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-16 Hui Zhou , Robert Latham , Ken Raffenetti , Yanfei Guo , Rajeev Thakur

Fault-Aware Non-Collective Communication Creation and Reparation in MPI

The increasing size of HPC architectures makes the faults' presence a more and more frequent eventuality. This issue becomes especially relevant since MPI, the de-facto standard for inter-process communication, lacks proper fault management…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-25 Roberto Rocco , Gianluca Palermo

MPI-over-CXL: Enhancing Communication Efficiency in Distributed HPC Systems

MPI implementations commonly rely on explicit memory-copy operations, incurring overhead from redundant data movement and buffer management. This overhead notably impacts HPC workloads involving intensive inter-processor communication. In…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-17 Miryeong Kwon , Donghyun Gouk , Hyein Woo , Junhee Kim , Jinwoo Baek , Kyungkuk Nam , Sangyoon Ji , Jiseon Kim , Hanyeoreum Bae , Junhyeok Jang , Hyunwoo You , Junseok Moon , Myoungsoo Jung

Tuning MPI Collectives by Verifying Performance Guidelines

MPI collective operations provide a standardized interface for performing data movements within a group of processes. The efficiency of collective communication operations depends on the actual algorithm, its implementation, and the…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-11 Sascha Hunold , Alexandra Carpen-Amarie

cMPI: Using CXL Memory Sharing for MPI One-Sided and Two-Sided Inter-Node Communications

Message Passing Interface (MPI) is a foundational programming model for high-performance computing. MPI libraries traditionally employ network interconnects (e.g., Ethernet and InfiniBand) and network protocols (e.g., TCP and RoCE) with…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-16 Xi Wang , Bin Ma , Jongryool Kim , Byungil Koh , Hoshik Kim , Dong Li

Building a fault tolerant application using the GASPI communication layer

It is commonly agreed that highly parallel software on Exascale computers will suffer from many more runtime failures due to the decreasing trend in the mean time to failures (MTTF). Therefore, it is not surprising that a lot of research is…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-05-19 Faisal Shahzad , Moritz Kreutzer , Thomas Zeiser , Rui Machado , Andreas Pieper , Georg Hager , Gerhard Wellein

Optimizing Irregular Communication with Neighborhood Collectives and Locality-Aware Parallelism

Irregular communication often limits both the performance and scalability of parallel applications. Typically, applications individually implement irregular messages using point-to-point communications, and any optimizations are added…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-06 Gerald Collom , Rui Peng Li , Amanda Bienz

MPI Benchmarking Revisited: Experimental Design and Reproducibility

The Message Passing Interface (MPI) is the prevalent programming model used on today's supercomputers. Therefore, MPI library developers are looking for the best possible performance (shortest run-time) of individual MPI functions across…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-05-30 Sascha Hunold , Alexandra Carpen-Amarie