Related papers: ReStore: In-Memory REplicated STORagE for Rapid Re…

FTHP-MPI: Towards Providing Replication-based Fault Tolerance in a Fault-Intolerant Native MPI Library

Faults in high-performance systems are expected to be very large in the current exascale computing era. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a much higher…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-15 Sarthak Joshi , Sathish Vadhiyar

PartRePer-MPI: Combining Fault Tolerance and Performance for MPI Applications

As we have entered Exascale computing, the faults in high-performance systems are expected to increase considerably. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-26 Sarthak Joshi , Sathish Vadhiyar

Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance

Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-16 Giorgis Georgakoudis , Luanzheng Guo , Ignacio Laguna

Checkpoint-Restart Libraries Must Become More Fault Tolerant

Production MPI codes need checkpoint-restart (CPR) support. Clearly, checkpoint-restart libraries must be fault tolerant lest they open up a window of vulnerability for failures with byzantine outcomes. But, certain popular libraries that…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-22 Anthony Skjellum , Derek Schafer

Invalidation-Based Protocols for Replicated Datastores

Distributed in-memory datastores underpin cloud applications that run within a datacenter and demand high performance, strong consistency, and availability. A key feature of datastores is data replication. The data are replicated across…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-07 Antonios Katsarakis

ReSHAPE: A Framework for Dynamic Resizing and Scheduling of Homogeneous Applications in a Parallel Environment

Applications in science and engineering often require huge computational resources for solving problems within a reasonable time frame. Parallel supercomputers provide the computational infrastructure for solving such problems. A…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 Rajesh Sudarsan , Calvin J. Ribbens

Efficient Multidimensional Data Redistribution for Resizable Parallel Computations

Traditional parallel schedulers running on cluster supercomputers support only static scheduling, where the number of processors allocated to an application remains fixed throughout the execution of the job. This results in…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-06-15 Rajesh Sudarsan , Calvin J. Ribbens

DXRAM's Fault-Tolerance Mechanisms Meet High Speed I/O Devices

In-memory key-value stores provide consistent low-latency access to all objects which is important for interactive large-scale applications like social media networks or online graph analytics and also opens up new application areas. But,…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-07-17 Kevin Beineke , Stefan Nothaas , Michael Schoettner

An Improved Multiple Faults Reassignment based Recovery in Cluster Computing

In case of multiple node failures performance becomes very low as compare to single node failure. Failures of nodes in cluster computing can be tolerated by multiple fault tolerant computing. Existing recovery schemes are efficient for…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-02-15 Sanjay Bansal , Sanjeev Sharma

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. HPC applications typically tolerate fail-stop failures by checkpointing. Another…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-06-22 Erlin Yao , Mingyu Chen , Rui Wang , Wenli Zhang , Guangming Tan

Characterizing Synchronous Writes in Stable Memory Devices

Distributed algorithms that operate in the fail-recovery model rely on the state stored in stable memory to guarantee the irreversibility of operations even in the presence of failures. The performance of these algorithms lean heavily on…

Operating Systems · Computer Science 2020-02-19 William B. Mingardi , Gustavo M. D. Vieira

Reliable Data Storage in Distributed Hash Tables

Distributed Hash Tables offer a resilient lookup service for unstable distributed environments. Resilient data storage, however, requires additional data replication and maintenance algorithms. These algorithms can have an impact on both…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 Matthew Leslie

Checkpoint/restart approaches for a thread-based MPI runtime

Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-13 Julien Adam , Maxime Kermarquer , Jean-Baptiste Besnard , Leonardo Bautista-Gomez , Marc Perache , Patrick Carribault , Julien Jaeger , Allen D. Malony , Sameer Shende

Fast Product-Matrix Regenerating Codes

Distributed storage systems support failures of individual devices by the use of replication or erasure correcting codes. While erasure correcting codes offer a better storage efficiency than replication for similar fault tolerance, they…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-12-10 Nicolas Le Scouarnec

How Machine Learning-Data Driven Replication Strategies Enhance Fault Tolerance in Large-Scale Distributed Systems

This research paper investigates how machine learning-driven data replication strategies can enhance fault tolerance in large-scale distributed systems. Traditional replication methods, which rely on static configurations, often struggle to…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-18 Almond Kiruthu Murimi

Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery

Efficient utilization of today's high-performance computing (HPC) systems with complex hardware and software components requires that the HPC applications are designed to tolerate process failures at runtime. With low mean time to failure…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-16 Rizwan A. Ashraf , Saurabh Hukerikar , Christian Engelmann

A Fault Tolerant Mechanism for Partitioning and Offloading Framework in Pervasive Environments

Application partitioning and code offloading are being researched extensively during the past few years. Several frameworks for code offloading have been proposed. However, fewer works attempted to address issues occurred with its…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-21 Nevin Vunka Jungum , Nawaz Mohamudally , Nimal Nissanke

Recollection: an Alternative Restoration Technique for Constraint Programming Systems

Search is a key service within constraint programming systems, and it demands the restoration of previously accessed states during the exploration of a search tree. Restoration proceeds either bottom-up within the tree to roll back…

Programming Languages · Computer Science 2016-02-05 Yong Lin , Martin Henz

From Reversible Computation to Checkpoint-Based Rollback Recovery for Message-Passing Concurrent Programs

The reliability of concurrent and distributed systems often depends on some well-known techniques for fault tolerance. One such technique is based on checkpointing and rollback recovery. Checkpointing involves processes to take snapshots of…

Programming Languages · Computer Science 2023-11-15 Germán Vidal

Reliable Replication Protocols on SmartNICs

Today's datacenter applications rely on datastores that are required to provide high availability, consistency, and performance. To achieve high availability, these datastores replicate data across several nodes. Such replication is managed…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-25 M. R. Siavash Katebzadeh , Antonios Katsarakis , Boris Grot