Related papers: Exploiting Universal Redundancy

Serial Parallel Reliability Redundancy Allocation Optimization for Energy Efficient and Fault Tolerant Cloud Computing

Serial-parallel redundancy is a reliable way to ensure service and systems will be available in cloud computing. That method involves making copies of the same system or program, with only one remaining active. When an error occurs, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-08 Gutha Jaya Krishna

Dependability in Embedded Systems: A Survey of Fault Tolerance Methods and Software-Based Mitigation Techniques

Fault tolerance is a critical aspect of modern computing systems, ensuring correct functionality in the presence of faults. This paper presents a comprehensive survey of fault tolerance methods and software-based mitigation techniques in…

Systems and Control · Electrical Eng. & Systems 2024-04-17 Mohammadreza Amel Solouki , Shaahin Angizi , Massimo Violante

Exploiting Redundant Computation in Communication-Avoiding Algorithms for Algorithm-Based Fault Tolerance

Communication-avoiding algorithms allow redundant computations to minimize the number of inter-process communications. In this paper, we propose to exploit this redundancy for fault-tolerance purpose. We illustrate this idea with QR…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-11-03 Camille Coti

A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 Michael Treaster

Towards Distributed Software Resilience in Asynchronous Many-Task Programming Models

Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will likely increase. Therefore, designing our…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-22 Nikunj Gupta , Jackson R. Mayo , Adrian S. Lemoine , Hartmut Kaiser

Defect tolerance: fundamental limits and examples

This paper addresses the problem of adding redundancy to a collection of physical objects so that the overall system is more robust to failures. In contrast to its information counterpart, which can exploit parity to protect multiple…

Information Theory · Computer Science 2017-11-09 Jennifer Tang , Da Wang , Yury Polyanskiy , Gregory Wornell

Industrial Computing Systems: A Case Study of Fault Tolerance Analysis

Fault tolerance is a key factor of industrial computing systems design. But in practical terms, these systems, like every commercial product, are under great financial constraints and they have to remain in operational state as long as…

Systems and Control · Computer Science 2015-03-31 Andrey A. Shchurov

FASTEN: Towards a FAult-tolerant and STorage EfficieNt Cloud: Balancing Between Replication and Deduplication

With the surge in cloud storage adoption, enterprises face challenges managing data duplication and exponential data growth. Deduplication mitigates redundancy, yet maintaining redundancy ensures high availability, incurring storage costs.…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-10 Sabbir Ahmed , Md Nahiduzzaman , Tariqul Islam , Faisal Haque Bappy , Tarannum Shaila Zaman , Raiful Hasan

Algorithmic Based Fault Tolerance Applied to High Performance Computing

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2008-06-20 George Bosilca , Remi Delmas , Jack Dongarra , Julien Langou

PCRAFT: Capacity Planning for Dependable Stateless Services

Fault-tolerance techniques depend on replication to enhance availability, albeit at the cost of increased infrastructure costs. This results in a fundamental trade-off: Fault-tolerant services must satisfy given availability and performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-16 Rasha Faqeh , Andrè Martin , Valerio Schiavoni , Pramod Bhatotia , Pascal Felber , Christof Fetzer

Measuring the Redundancy of Information from a Source Failure Perspective

In this paper, we define a new measure of the redundancy of information from a fault tolerance perspective. The partial information decomposition (PID) emerged last decade as a framework for decomposing the multi-source mutual information…

Information Theory · Computer Science 2024-04-03 Jesse Milzman

ApproxABFT: Approximate Algorithm-Based Fault Tolerance for Neural Network Processing

With the increasing deployment of deep neural networks (DNNs) in terrestrial and aerospace safety-critical applications, system reliability has emerged as a co-equal design metric alongside computational efficiency. Algorithm-based fault…

Cryptography and Security · Computer Science 2025-04-22 Xinghua Xue , Cheng Liu , Feng Min , Tao Luo , Yinhe Han

RRFT: A Rank-Based Resource Aware Fault Tolerant Strategy for Cloud Platforms

The applications that are deployed in the cloud to provide services to the users encompass a large number of interconnected dependent cloud components. Multiple identical components are scheduled to run concurrently in order to handle…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-12 Chinmaya Kumar Dehury , Prasan Kumar Sahoo , Bharadwaj Veeravalli

On the Performance and Convergence of Distributed Stream Processing via Approximate Fault Tolerance

Fault tolerance is critical for distributed stream processing systems, yet achieving error-free fault tolerance often incurs substantial performance overhead. We present AF-Stream, a distributed stream processing system that addresses the…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-13 Zhinan Cheng , Qun Huang , Patrick P. C. Lee

A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems

The idle computers on a local area, campus area, or even wide area network represent a significant computational resource---one that is, however, also unreliable, heterogeneous, and opportunistic. This type of resource has been used…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 Adriana Iamnitchi , Ian Foster

Fault Tolerance in Distributed Neural Computing

With the increasing complexity of computing systems, complete hardware reliability can no longer be guaranteed. We need, however, to ensure overall system reliability. One of the most important features of artificial neural networks is…

Neural and Evolutionary Computing · Computer Science 2015-10-07 Anton Kulakov , Mark Zwolinski , Jeff Reeve

Fault-Tolerant Design Approach Based on Approximate Computing

Triple Modular Redundancy (TMR) has been traditionally used to ensure complete tolerance to a single fault or a faulty processing unit, where the processing unit may be a circuit or a system. However, TMR incurs more than 200% overhead in…

Hardware Architecture · Computer Science 2023-11-02 P Balasubramanian , D L Maskell

How Machine Learning-Data Driven Replication Strategies Enhance Fault Tolerance in Large-Scale Distributed Systems

This research paper investigates how machine learning-driven data replication strategies can enhance fault tolerance in large-scale distributed systems. Traditional replication methods, which rely on static configurations, often struggle to…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-18 Almond Kiruthu Murimi

Investigating the Reliability in Three RAID Storage Models and Effect of Ordering Replicas on Disks

One of the most important parts of cloud computing is storage devices, and Redundant Array of Independent Disks (RAID) systems are well known and frequently used storage devices. With the increasing production of data in cloud environments,…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-06 Leila Namvari-Tazehkand , Saeid Pashazadeh

Model-Based Generation of Attack-Fault Trees

Joint safety and security analysis of cyber-physical systems is a necessary step to correctly capture inter-dependencies between these properties. Attack-Fault Trees represent a combination of dynamic Fault Trees and Attack Trees and can be…

Cryptography and Security · Computer Science 2023-09-19 Raffaela Groner , Thomas Witte , Alexander Raschke , Sophie Hirn , Irdin Pekaric , Markus Frick , Matthias Tichy , Michael Felderer