Related papers: Fault Tolerance in Distributed Neural Computing

Towards Dynamic Fault Tolerance for Hardware-Implemented Artificial Neural Networks: A Deep Learning Approach

The functionality of electronic circuits can be seriously impaired by the occurrence of dynamic hardware faults. Particularly, for digital ultra low-power systems, a reduced safety margin can increase the probability of dynamic failures.…

Machine Learning · Computer Science 2022-10-18 Daniel Gregorek , Nils Hülsmeier , Steffen Paul

Distributed Learning over Unreliable Networks

Most of today's distributed machine learning systems assume {\em reliable networks}: whenever two machines exchange information (e.g., gradients or models), the network should guarantee the delivery of the message. At the same time, recent…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-05-17 Chen Yu , Hanlin Tang , Cedric Renggli , Simon Kassing , Ankit Singla , Dan Alistarh , Ce Zhang , Ji Liu

A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems

The idle computers on a local area, campus area, or even wide area network represent a significant computational resource---one that is, however, also unreliable, heterogeneous, and opportunistic. This type of resource has been used…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 Adriana Iamnitchi , Ian Foster

Dependability in Embedded Systems: A Survey of Fault Tolerance Methods and Software-Based Mitigation Techniques

Fault tolerance is a critical aspect of modern computing systems, ensuring correct functionality in the presence of faults. This paper presents a comprehensive survey of fault tolerance methods and software-based mitigation techniques in…

Systems and Control · Electrical Eng. & Systems 2024-04-17 Mohammadreza Amel Solouki , Shaahin Angizi , Massimo Violante

A Survey on Fault-tolerance in Distributed Optimization and Machine Learning

The robustness of distributed optimization is an emerging field of study, motivated by various applications of distributed optimization including distributed machine learning, distributed sensing, and swarm robotics. With the rapid…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-29 Shuo Liu

Self-healing Dilemmas in Distributed Systems: Fault Correction vs. Fault Tolerance

Large-scale decentralized systems of autonomous agents interacting via asynchronous communication often experience the following self-healing dilemma: fault detection inherits network uncertainties making a remote faulty process…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-25 Jovan Nikolic , Nursultan Jubatyrov , Evangelos Pournaras

Fault-Tolerant Deep Learning: A Hierarchical Perspective

With the rapid advancements of deep learning in the past decade, it can be foreseen that deep learning will be continuously deployed in more and more safety-critical applications such as autonomous driving and robotics. In this context,…

Hardware Architecture · Computer Science 2022-04-06 Cheng Liu , Zhen Gao , Siting Liu , Xuefei Ning , Huawei Li , Xiaowei Li

Fault-Tolerant Neural Networks from Biological Error Correction Codes

It has been an open question in deep learning if fault-tolerant computation is possible: can arbitrarily reliable computation be achieved using only unreliable neurons? In the grid cells of the mammalian cortex, analog error correction…

Machine Learning · Computer Science 2025-03-26 Alexander Zlokapa , Andrew K. Tan , John M. Martyn , Ila R. Fiete , Max Tegmark , Isaac L. Chuang

Embedded Distributed Inference of Deep Neural Networks: A Systematic Review

Embedded distributed inference of Neural Networks has emerged as a promising approach for deploying machine-learning models on resource-constrained devices in an efficient and scalable manner. The inference task is distributed across a…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-07 Federico Nicolás Peccia , Oliver Bringmann

Cross-Layer Optimization for Fault-Tolerant Deep Learning

Fault-tolerant deep learning accelerator is the basis for highly reliable deep learning processing and critical to deploy deep learning in safety-critical applications such as avionics and robotics. Since deep learning is known to be…

Hardware Architecture · Computer Science 2023-12-22 Qing Zhang , Cheng Liu , Bo Liu , Haitong Huang , Ying Wang , Huawei Li , Xiaowei Li

Starting a Dialog between Model Checking and Fault-tolerant Distributed Algorithms

Fault-tolerant distributed algorithms are central for building reliable spatially distributed systems. Unfortunately, the lack of a canonical precise framework for fault-tolerant algorithms is an obstacle for both verification and…

Formal Languages and Automata Theory · Computer Science 2012-10-16 Annu John , Igor Konnov , Ulrich Schmid , Helmut Veith , Josef Widder

Fault-Tolerant Collaborative Inference through the Edge-PRUNE Framework

Collaborative inference has received significant research interest in machine learning as a vehicle for distributing computation load, reducing latency, as well as addressing privacy preservation in communications. Recent collaborative…

Machine Learning · Computer Science 2022-06-17 Jani Boutellier , Bo Tan , Jari Nurmi

Decentralized Validation for Non-malicious Arbitrary Fault Tolerance in Paxos

Fault-tolerant distributed systems offer high reliability because even if faults in their components occur, they do not exhibit erroneous behavior. Depending on the fault model adopted, hardware and software errors that do not result in a…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-19 Rodrigo R. Barbieri , Enrique S. dos Santos , Gustavo M. D. Vieira

The Probabilistic Fault Tolerance of Neural Networks in the Continuous Limit

The loss of a few neurons in a brain rarely results in any visible loss of function. However, the insight into what "few" means in this context is unclear. How many random neuron failures will it take to lead to a visible loss of function?…

Machine Learning · Statistics 2019-09-26 El-Mahdi El-Mhamdi , Rachid Guerraoui , Andrei Kucharavy , Sergei Volodin

Intrinsic Numerical Robustness and Fault Tolerance in a Neuromorphic Algorithm for Scientific Computing

The potential for neuromorphic computing to provide intrinsic fault tolerance has long been speculated, but the brain's robustness in neuromorphic applications has yet to be demonstrated. Here, we show that a previously described, natively…

Neural and Evolutionary Computing · Computer Science 2026-03-12 Bradley H. Theilman , James B. Aimone

FAT: Training Neural Networks for Reliable Inference Under Hardware Faults

Deep neural networks (DNNs) are state-of-the-art algorithms for multiple applications, spanning from image classification to speech recognition. While providing excellent accuracy, they often have enormous compute and memory requirements.…

Machine Learning · Computer Science 2020-11-12 Ussama Zahid , Giulio Gambardella , Nicholas J. Fraser , Michaela Blott , Kees Vissers

Time-Series Learning for Proactive Fault Prediction in Distributed Systems with Deep Neural Structures

This paper addresses the challenges of fault prediction and delayed response in distributed systems by proposing an intelligent prediction method based on temporal feature learning. The method takes multi-dimensional performance metric…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-28 Yang Wang , Wenxuan Zhu , Xuehui Quan , Heyi Wang , Chang Liu , Qiyuan Wu

On Misbehaviour and Fault Tolerance in Machine Learning Systems

Machine learning (ML) provides us with numerous opportunities, allowing ML systems to adapt to new situations and contexts. At the same time, this adaptability raises uncertainties concerning the run-time product quality or dependability,…

Software Engineering · Computer Science 2022-10-18 Lalli Myllyaho , Mikko Raatikainen , Tomi Männistö , Jukka K. Nurminen , Tommi Mikkonen

Efficient Error-Tolerant Quantized Neural Network Accelerators

Neural Networks are currently one of the most widely deployed machine learning algorithms. In particular, Convolutional Neural Networks (CNNs), are gaining popularity and are evaluated for deployment in safety critical applications such as…

Signal Processing · Electrical Eng. & Systems 2019-12-17 Giulio Gambardella , Johannes Kappauf , Michaela Blott , Christoph Doehring , Martin Kumm , Peter Zipf , Kees Vissers

A Study of Deep Learning Robustness Against Computation Failures

For many types of integrated circuits, accepting larger failure rates in computations can be used to improve energy efficiency. We study the performance of faulty implementations of certain deep neural networks based on pessimistic and…

Neural and Evolutionary Computing · Computer Science 2017-04-19 Jean-Charles Vialatte , François Leduc-Primeau