Related papers: Towards Distributed Software Resilience in Asynchr…

Implementing Software Resiliency in HPX for Extreme Scale Computing

Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will invariably increase. Therefore, designing…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-16 Nikunj Gupta , Jackson R. Mayo , Adrian S. Lemoine , Hartmut Kaiser

Algorithmic Based Fault Tolerance Applied to High Performance Computing

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2008-06-20 George Bosilca , Remi Delmas , Jack Dongarra , Julien Langou

FTHP-MPI: Towards Providing Replication-based Fault Tolerance in a Fault-Intolerant Native MPI Library

Faults in high-performance systems are expected to be very large in the current exascale computing era. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a much higher…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-15 Sarthak Joshi , Sathish Vadhiyar

Application-Level Resilience Modeling for HPC Fault Tolerance

Understanding the application resilience in the presence of faults is critical to address the HPC resilience challenge. Currently, we largely rely on random fault injection (RFI) to quantify the application resilience. However, RFI provides…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-05-02 Luanzheng Guo , Hanlin He , Dong Li

Exploiting Universal Redundancy

Fault tolerance is essential for building reliable services; however, it comes at the price of redundancy, mainly the "replication factor" and "diversity". With the increasing reliance on Internet-based services, more machines (mainly…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-10-04 Ali Shoker

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. HPC applications typically tolerate fail-stop failures by checkpointing. Another…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-06-22 Erlin Yao , Mingyu Chen , Rui Wang , Wenli Zhang , Guangming Tan

A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems

The idle computers on a local area, campus area, or even wide area network represent a significant computational resource---one that is, however, also unreliable, heterogeneous, and opportunistic. This type of resource has been used…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 Adriana Iamnitchi , Ian Foster

Supporting OpenMP 5.0 Tasks in hpxMP -- A study of an OpenMP implementation within Task Based Runtime Systems

OpenMP has been the de facto standard for single node parallelism for more than a decade. Recently, asynchronous many-task runtime (AMT) systems have increased in popularity as a new programming paradigm for high performance computing…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-20 Tianyi Zhang , Shahrzad Shirzad , Bibek Wagle , Adrian S. Lemoine , Patrick Diehl , Hartmut Kaiser

How Machine Learning-Data Driven Replication Strategies Enhance Fault Tolerance in Large-Scale Distributed Systems

This research paper investigates how machine learning-driven data replication strategies can enhance fault tolerance in large-scale distributed systems. Traditional replication methods, which rely on static configurations, often struggle to…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-18 Almond Kiruthu Murimi

Resilience through Automated Adaptive Configuration for Distribution and Replication

This paper presents a powerful automated framework for making complex systems resilient under failures, by optimized adaptive distribution and replication of interdependent software components across heterogeneous hardware components with…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-13 Scott D. Stoller , Balaji Jayasankar , Yanhong A. Liu

RRFT: A Rank-Based Resource Aware Fault Tolerant Strategy for Cloud Platforms

The applications that are deployed in the cloud to provide services to the users encompass a large number of interconnected dependent cloud components. Multiple identical components are scheduled to run concurrently in order to handle…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-12 Chinmaya Kumar Dehury , Prasan Kumar Sahoo , Bharadwaj Veeravalli

PartRePer-MPI: Combining Fault Tolerance and Performance for MPI Applications

As we have entered Exascale computing, the faults in high-performance systems are expected to increase considerably. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-26 Sarthak Joshi , Sathish Vadhiyar

An Introduction to hpxMP: A Modern OpenMP Implementation Leveraging HPX, An Asynchronous Many-Task System

Asynchronous Many-task (AMT) runtime systems have gained increasing acceptance in the HPC community due to the performance improvements offered by fine-grained tasking runtime systems. At the same time, C++ standardization efforts are…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-09 Tianyi Zhang , Shahrzad Shirzad , Patrick Diehl , R. Tohid , Weile Wei , Hartmut Kaiser

An Efficient Fault Tolerant Workflow Scheduling Approach using Replication Heuristics and Checkpointing in the Cloud

Scientific workflows have been predominantly used for complex and large scale data analysis and scientific computation/automation and the need for robust workflow scheduling techniques has grown considerably. But, most of the existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-04 S. Jaya Nirmala , Amrith Rajagopal Setlur , Har Simrat Singh , Sudhanshu Khoriya

Experiences Porting Distributed Applications to Asynchronous Tasks: A Multidimensional FFT Case-study

Parallel algorithms relying on synchronous parallelization libraries often experience adverse performance due to global synchronization barriers. Asynchronous many-task runtimes offer task futurization capabilities that minimize or remove…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-05 Alexander Strack , Christopher Taylor , Patrick Diehl , Dirk Pflüger

Efficient Task Replication for Fast Response Times in Parallel Computation

One typical use case of large-scale distributed computing in data centers is to decompose a computation job into many independent tasks and run them in parallel on different machines, sometimes known as the "embarrassingly parallel"…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-04-07 Da Wang , Gauri Joshi , Gregory Wornell

WRATH: Workload Resilience Across Task Hierarchies in Task-based Parallel Programming Frameworks

Failures in Task-based Parallel Programming (TBPP) can severely degrade performance and result in incomplete or incorrect outcomes. Existing failure-handling approaches, including reactive, proactive, and resilient methods such as retry and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-31 Sicheng Zhou , Zhuozhao Li , Valérie Hayot-Sasson , Haochen Pan , Maxime Gonthier , J. Gregory Pauloski , Ryan Chard , Kyle Chard , Ian Foster

A Fault Tolerance Mechanism for Hybrid Scientific Workflows

In large distributed systems, failures are a daily event occurring frequently, especially with growing numbers of computation tasks and locations on which they are deployed. The advantage of representing an application with a workflow is…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-09 Alberto Mulone , Doriana Medić , Marco Aldinucci

Advanced Synchronization Techniques for Task-based Runtime Systems

Task-based programming models like OmpSs-2 and OpenMP provide a flexible data-flow execution model to exploit dynamic, irregular and nested parallelism. Providing an efficient implementation that scales well with small granularity tasks…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-18 David Álvarez , Kevin Sala , Marcos Maroñas , Aleix Roca , Vicenç Beltran

Serial Parallel Reliability Redundancy Allocation Optimization for Energy Efficient and Fault Tolerant Cloud Computing

Serial-parallel redundancy is a reliable way to ensure service and systems will be available in cloud computing. That method involves making copies of the same system or program, with only one remaining active. When an error occurs, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-08 Gutha Jaya Krishna