Related papers: Algorithmic Based Fault Tolerance Applied to High …
Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. HPC applications typically tolerate fail-stop failures by checkpointing. Another…
Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components…
This paper continues to develop a fault tolerant extension of the sparse grid combination technique recently proposed in [B. Harding and M. Hegland, ANZIAM J., 54 (CTAC2012), pp. C394-C411]. The approach is novel for two reasons, first it…
Fault tolerance is critical for distributed stream processing systems, yet achieving error-free fault tolerance often incurs substantial performance overhead. We present AF-Stream, a distributed stream processing system that addresses the…
Background: Large-scale biological jobs on high-performance computing systems require manual intervention if one or more computing cores on which they execute fail. This places not only a cost on the maintenance of the job, but also a cost…
With the increasing deployment of deep neural networks (DNNs) in terrestrial and aerospace safety-critical applications, system reliability has emerged as a co-equal design metric alongside computational efficiency. Algorithm-based fault…
With the increasing complexity of computing systems, complete hardware reliability can no longer be guaranteed. We need, however, to ensure overall system reliability. One of the most important features of artificial neural networks is…
Fault tolerance is a major concern in distributed computational settings. In the classic master-worker setting, a server (the master) needs to perform some heavy computation which it may distribute to $m$ other machines (workers) in order…
GPU-based fast Fourier transform (FFT) is extremely important for scientific computing and signal processing. However, we find the inefficiency of existing FFT libraries and the absence of fault tolerance against soft error. To address…
Basic Linear Algebra Subprograms (BLAS) is a core library in scientific computing and machine learning. This paper presents FT-BLAS, a new implementation of BLAS routines that not only tolerates soft errors on the fly, but also provides…
In this paper, we examine the different measures of Fault Tolerance in a Distributed Simulated Annealing process. Optimization by Simulated Annealing on a distributed system is prone to various sources of failure. We analyse simulated…
Application partitioning and code offloading are being researched extensively during the past few years. Several frameworks for code offloading have been proposed. However, fewer works attempted to address issues occurred with its…
The work reported in this paper is motivated towards validating an alternative approach for fault tolerance over traditional methods like checkpointing that constrain efficacious fault tolerance. Can agent intelligence be used to achieve…
The idle computers on a local area, campus area, or even wide area network represent a significant computational resource---one that is, however, also unreliable, heterogeneous, and opportunistic. This type of resource has been used…
Fault tolerance in multi-core architecture has attracted attention of research community for the past 20 years. Rapid improvements in the CMOS technology resulted in exponential growth of transistor density. It resulted in increased…
Fault tolerance is a property which needs deeper consideration when dealing with streaming jobs requiring high levels of availability and low-latency processing even in case of failures where Quality-of-Service constraints must be adhered…
The traditional approach to fault tolerant computing involves replicating computation units and applying a majority vote operation on individual result bits. This approach, however, has several limitations; the most severe is the resource…
Parallel computing is very important to accelerate the performance of software systems. Additionally, considering that a recurring challenge is to process high data volumes continuously, stream processing emerged as a paradigm and software…
The increase in HPC systems size and complexity, together with increasing on-chip transistor density, power limitations, and number of components, render modern HPC systems subject to soft errors. Silent data corruptions (SDCs) are…
Fault tolerant quantum computing methods which work with efficient quantum error correcting codes are discussed. Several new techniques are introduced to restrict accumulation of errors before or during the recovery. Classes of eligible…