Related papers: Algorithmic Based Fault Tolerance Applied to High …

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. HPC applications typically tolerate fail-stop failures by checkpointing. Another…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-06-22 Erlin Yao , Mingyu Chen , Rui Wang , Wenli Zhang , Guangming Tan

A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 Michael Treaster

Scalable and Fault Tolerant Computation with the Sparse Grid Combination Technique

This paper continues to develop a fault tolerant extension of the sparse grid combination technique recently proposed in [B. Harding and M. Hegland, ANZIAM J., 54 (CTAC2012), pp. C394-C411]. The approach is novel for two reasons, first it…

Numerical Analysis · Mathematics 2014-04-11 Brendan Harding , Markus Hegland , Jay Larson , James Southern

On the Performance and Convergence of Distributed Stream Processing via Approximate Fault Tolerance

Fault tolerance is critical for distributed stream processing systems, yet achieving error-free fault tolerance often incurs substantial performance overhead. We present AF-Stream, a distributed stream processing system that addresses the…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-13 Zhinan Cheng , Qun Huang , Patrick P. C. Lee

Automating Fault Tolerance in High-Performance Computational Biological Jobs Using Multi-Agent Approaches

Background: Large-scale biological jobs on high-performance computing systems require manual intervention if one or more computing cores on which they execute fail. This places not only a cost on the maintenance of the job, but also a cost…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-03-04 Blesson Varghese , Gerard McKee , Vassil Alexandrov

ApproxABFT: Approximate Algorithm-Based Fault Tolerance for Neural Network Processing

With the increasing deployment of deep neural networks (DNNs) in terrestrial and aerospace safety-critical applications, system reliability has emerged as a co-equal design metric alongside computational efficiency. Algorithm-based fault…

Cryptography and Security · Computer Science 2025-04-22 Xinghua Xue , Cheng Liu , Feng Min , Tao Luo , Yinhe Han

Fault Tolerance in Distributed Neural Computing

With the increasing complexity of computing systems, complete hardware reliability can no longer be guaranteed. We need, however, to ensure overall system reliability. One of the most important features of artificial neural networks is…

Neural and Evolutionary Computing · Computer Science 2015-10-07 Anton Kulakov , Mark Zwolinski , Jeff Reeve

Near-Optimal Fault Tolerance for Efficient Batch Matrix Multiplication via an Additive Combinatorics Lens

Fault tolerance is a major concern in distributed computational settings. In the classic master-worker setting, a server (the master) needs to perform some heavy computation which it may distribute to $m$ other machines (workers) in order…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-30 Keren Censor-Hillel , Yuka Machino , Pedro Soto

TurboFFT: Co-Designed High-Performance and Fault-Tolerant Fast Fourier Transform on GPUs

GPU-based fast Fourier transform (FFT) is extremely important for scientific computing and signal processing. However, we find the inefficiency of existing FFT libraries and the absence of fault tolerance against soft error. To address…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-10 Shixun Wu , Yujia Zhai , Jinyang Liu , Jiajun Huang , Zizhe Jian , Huangliang Dai , Sheng Di , Franck Cappello , Zizhong Chen

FT-BLAS: A High Performance BLAS Implementation With Online Fault Tolerance

Basic Linear Algebra Subprograms (BLAS) is a core library in scientific computing and machine learning. This paper presents FT-BLAS, a new implementation of BLAS routines that not only tolerates soft errors on the fly, but also provides…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-09 Yujia Zhai , Elisabeth Giem , Quan Fan , Kai Zhao , Jinyang Liu , Zizhong Chen

Measures of Fault Tolerance in Distributed Simulated Annealing

In this paper, we examine the different measures of Fault Tolerance in a Distributed Simulated Annealing process. Optimization by Simulated Annealing on a distributed system is prone to various sources of failure. We analyse simulated…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-01-01 Aaditya Prakash

A Fault Tolerant Mechanism for Partitioning and Offloading Framework in Pervasive Environments

Application partitioning and code offloading are being researched extensively during the past few years. Several frameworks for code offloading have been proposed. However, fewer works attempted to address issues occurred with its…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-21 Nevin Vunka Jungum , Nawaz Mohamudally , Nimal Nissanke

Can Agent Intelligence be used to Achieve Fault Tolerant Parallel Computing Systems?

The work reported in this paper is motivated towards validating an alternative approach for fault tolerance over traditional methods like checkpointing that constrain efficacious fault tolerance. Can agent intelligence be used to achieve…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-08-14 Blesson Varghese , Gerard McKee , Vassil Alexandrov

A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems

The idle computers on a local area, campus area, or even wide area network represent a significant computational resource---one that is, however, also unreliable, heterogeneous, and opportunistic. This type of resource has been used…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 Adriana Iamnitchi , Ian Foster

A Survey of fault mitigation techniques for multi-core architectures

Fault tolerance in multi-core architecture has attracted attention of research community for the past 20 years. Rapid improvements in the CMOS technology resulted in exponential growth of transistor density. It resulted in increased…

Hardware Architecture · Computer Science 2022-01-03 Shashikiran Venkatesha , Ranjani Parthasarathi

Chiron: Optimizing Fault Tolerance in QoS-aware Distributed Stream Processing Jobs

Fault tolerance is a property which needs deeper consideration when dealing with streaming jobs requiring high levels of availability and low-latency processing even in case of failures where Quality-of-Service constraints must be adhered…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-12 Morgan Geldenhuys , Lauritz Thamsen , Odej Kao

Computer Arithmetic Preserving Hamming Distance of Operands in Operation Result

The traditional approach to fault tolerant computing involves replicating computation units and applying a majority vote operation on individual result bits. This approach, however, has several limitations; the most severe is the resource…

Hardware Architecture · Computer Science 2011-04-19 Shlomi Dolev , Sergey Frenkel , Dan Tamir

High-level Stream Processing: A Complementary Analysis of Fault Recovery

Parallel computing is very important to accelerate the performance of software systems. Additionally, considering that a recurring challenge is to process high data volumes continuously, stream processing emerged as a paradigm and software…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-14 Adriano Vogel , Sören Henning , Esteban Perez-Wohlfeil , Otmar Ertl , Rick Rabiser

Algorithm-Based Fault Tolerance for Parallel Stencil Computations

The increase in HPC systems size and complexity, together with increasing on-chip transistor density, power limitations, and number of components, render modern HPC systems subject to soft errors. Silent data corruptions (SDCs) are…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-04 Aurélien Cavelan , Florina M. Ciorba

Efficient fault-tolerant quantum computing

Fault tolerant quantum computing methods which work with efficient quantum error correcting codes are discussed. Several new techniques are introduced to restrict accumulation of errors before or during the recovery. Classes of eligible…

Quantum Physics · Physics 2009-10-31 Andrew M. Steane