Related papers: Exploiting Universal Redundancy
Serial-parallel redundancy is a reliable way to ensure service and systems will be available in cloud computing. That method involves making copies of the same system or program, with only one remaining active. When an error occurs, the…
Fault tolerance is a critical aspect of modern computing systems, ensuring correct functionality in the presence of faults. This paper presents a comprehensive survey of fault tolerance methods and software-based mitigation techniques in…
Communication-avoiding algorithms allow redundant computations to minimize the number of inter-process communications. In this paper, we propose to exploit this redundancy for fault-tolerance purpose. We illustrate this idea with QR…
Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components…
Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will likely increase. Therefore, designing our…
This paper addresses the problem of adding redundancy to a collection of physical objects so that the overall system is more robust to failures. In contrast to its information counterpart, which can exploit parity to protect multiple…
Fault tolerance is a key factor of industrial computing systems design. But in practical terms, these systems, like every commercial product, are under great financial constraints and they have to remain in operational state as long as…
With the surge in cloud storage adoption, enterprises face challenges managing data duplication and exponential data growth. Deduplication mitigates redundancy, yet maintaining redundancy ensures high availability, incurring storage costs.…
We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel…
Fault-tolerance techniques depend on replication to enhance availability, albeit at the cost of increased infrastructure costs. This results in a fundamental trade-off: Fault-tolerant services must satisfy given availability and performance…
In this paper, we define a new measure of the redundancy of information from a fault tolerance perspective. The partial information decomposition (PID) emerged last decade as a framework for decomposing the multi-source mutual information…
With the increasing deployment of deep neural networks (DNNs) in terrestrial and aerospace safety-critical applications, system reliability has emerged as a co-equal design metric alongside computational efficiency. Algorithm-based fault…
The applications that are deployed in the cloud to provide services to the users encompass a large number of interconnected dependent cloud components. Multiple identical components are scheduled to run concurrently in order to handle…
Fault tolerance is critical for distributed stream processing systems, yet achieving error-free fault tolerance often incurs substantial performance overhead. We present AF-Stream, a distributed stream processing system that addresses the…
The idle computers on a local area, campus area, or even wide area network represent a significant computational resource---one that is, however, also unreliable, heterogeneous, and opportunistic. This type of resource has been used…
With the increasing complexity of computing systems, complete hardware reliability can no longer be guaranteed. We need, however, to ensure overall system reliability. One of the most important features of artificial neural networks is…
Triple Modular Redundancy (TMR) has been traditionally used to ensure complete tolerance to a single fault or a faulty processing unit, where the processing unit may be a circuit or a system. However, TMR incurs more than 200% overhead in…
This research paper investigates how machine learning-driven data replication strategies can enhance fault tolerance in large-scale distributed systems. Traditional replication methods, which rely on static configurations, often struggle to…
One of the most important parts of cloud computing is storage devices, and Redundant Array of Independent Disks (RAID) systems are well known and frequently used storage devices. With the increasing production of data in cloud environments,…
Joint safety and security analysis of cyber-physical systems is a necessary step to correctly capture inter-dependencies between these properties. Attack-Fault Trees represent a combination of dynamic Fault Trees and Attack Trees and can be…