Related papers: Fault Tolerance in Distributed Neural Computing
The functionality of electronic circuits can be seriously impaired by the occurrence of dynamic hardware faults. Particularly, for digital ultra low-power systems, a reduced safety margin can increase the probability of dynamic failures.…
Most of today's distributed machine learning systems assume {\em reliable networks}: whenever two machines exchange information (e.g., gradients or models), the network should guarantee the delivery of the message. At the same time, recent…
The idle computers on a local area, campus area, or even wide area network represent a significant computational resource---one that is, however, also unreliable, heterogeneous, and opportunistic. This type of resource has been used…
Fault tolerance is a critical aspect of modern computing systems, ensuring correct functionality in the presence of faults. This paper presents a comprehensive survey of fault tolerance methods and software-based mitigation techniques in…
The robustness of distributed optimization is an emerging field of study, motivated by various applications of distributed optimization including distributed machine learning, distributed sensing, and swarm robotics. With the rapid…
Large-scale decentralized systems of autonomous agents interacting via asynchronous communication often experience the following self-healing dilemma: fault detection inherits network uncertainties making a remote faulty process…
With the rapid advancements of deep learning in the past decade, it can be foreseen that deep learning will be continuously deployed in more and more safety-critical applications such as autonomous driving and robotics. In this context,…
It has been an open question in deep learning if fault-tolerant computation is possible: can arbitrarily reliable computation be achieved using only unreliable neurons? In the grid cells of the mammalian cortex, analog error correction…
Embedded distributed inference of Neural Networks has emerged as a promising approach for deploying machine-learning models on resource-constrained devices in an efficient and scalable manner. The inference task is distributed across a…
Fault-tolerant deep learning accelerator is the basis for highly reliable deep learning processing and critical to deploy deep learning in safety-critical applications such as avionics and robotics. Since deep learning is known to be…
Fault-tolerant distributed algorithms are central for building reliable spatially distributed systems. Unfortunately, the lack of a canonical precise framework for fault-tolerant algorithms is an obstacle for both verification and…
Collaborative inference has received significant research interest in machine learning as a vehicle for distributing computation load, reducing latency, as well as addressing privacy preservation in communications. Recent collaborative…
Fault-tolerant distributed systems offer high reliability because even if faults in their components occur, they do not exhibit erroneous behavior. Depending on the fault model adopted, hardware and software errors that do not result in a…
The loss of a few neurons in a brain rarely results in any visible loss of function. However, the insight into what "few" means in this context is unclear. How many random neuron failures will it take to lead to a visible loss of function?…
The potential for neuromorphic computing to provide intrinsic fault tolerance has long been speculated, but the brain's robustness in neuromorphic applications has yet to be demonstrated. Here, we show that a previously described, natively…
Deep neural networks (DNNs) are state-of-the-art algorithms for multiple applications, spanning from image classification to speech recognition. While providing excellent accuracy, they often have enormous compute and memory requirements.…
This paper addresses the challenges of fault prediction and delayed response in distributed systems by proposing an intelligent prediction method based on temporal feature learning. The method takes multi-dimensional performance metric…
Machine learning (ML) provides us with numerous opportunities, allowing ML systems to adapt to new situations and contexts. At the same time, this adaptability raises uncertainties concerning the run-time product quality or dependability,…
Neural Networks are currently one of the most widely deployed machine learning algorithms. In particular, Convolutional Neural Networks (CNNs), are gaining popularity and are evaluated for deployment in safety critical applications such as…
For many types of integrated circuits, accepting larger failure rates in computations can be used to improve energy efficiency. We study the performance of faulty implementations of certain deep neural networks based on pessimistic and…