Related papers: ApproxABFT: Approximate Algorithm-Based Fault Tole…

Arithmetic-Intensity-Guided Fault Tolerance for Neural Network Inference on GPUs

Neural networks (NNs) are increasingly employed in safety-critical domains and in environments prone to unreliability (e.g., soft errors), such as on spacecraft. Therefore, it is critical to impart fault tolerance to NN inference.…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-08 Jack Kosaian , K. V. Rashmi

V-ABFT: Variance-Based Adaptive Threshold for Fault-Tolerant Matrix Multiplication in Mixed-Precision Deep Learning

Algorithm-Based Fault Tolerance (ABFT) is widely adopted to detect silent data corruptions (SDCs) in matrix multiplication, a cornerstone operation in deep learning systems. However, existing threshold determination methods face critical…

Machine Learning · Computer Science 2026-02-10 Yiheng Gao , Qin Hua , Zizhong Chen

Near-Threshold Voltage Massive MIMO Computing

Massive MIMO systems have the potential to significantly enhance spectral efficiency, yet their widespread integration is hindered by the high power consumption of the underlying computations. This paper explores the applicability and…

Signal Processing · Electrical Eng. & Systems 2025-09-09 Mikael Rinkinen , Mehdi Safarpour , Shahriar Shahabuddin , Olli Silven , Lauri Koskinen

FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks

Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-26 Kai Zhao , Sheng Di , Sihuan Li , Xin Liang , Yujia Zhai , Jieyang Chen , Kaiming Ouyang , Franck Cappello , Zizhong Chen

Algorithm-Based Fault Tolerance for Parallel Stencil Computations

The increase in HPC systems size and complexity, together with increasing on-chip transistor density, power limitations, and number of components, render modern HPC systems subject to soft errors. Silent data corruptions (SDCs) are…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-04 Aurélien Cavelan , Florina M. Ciorba

Custom Algorithm-based Fault Tolerance for Attention Layers in Transformers

Transformers and large language models (LLMs), powered by the attention mechanism, have transformed numerous AI applications, driving the need for specialized hardware accelerators. A major challenge in these accelerators is efficiently…

Machine Learning · Computer Science 2025-07-23 Vasileios Titopoulos , Kosmas Alexandridis , Giorgos Dimitrakopoulos

TurboFFT: Co-Designed High-Performance and Fault-Tolerant Fast Fourier Transform on GPUs

GPU-based fast Fourier transform (FFT) is extremely important for scientific computing and signal processing. However, we find the inefficiency of existing FFT libraries and the absence of fault tolerance against soft error. To address…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-10 Shixun Wu , Yujia Zhai , Jinyang Liu , Jiajun Huang , Zizhe Jian , Huangliang Dai , Sheng Di , Franck Cappello , Zizhong Chen

ReaLM: Reliable and Efficient Large Language Model Inference with Statistical Algorithm-Based Fault Tolerance

The demand for efficient large language model (LLM) inference has propelled the development of dedicated accelerators. As accelerators are vulnerable to hardware faults due to aging, variation, etc, existing accelerator designs often…

Hardware Architecture · Computer Science 2025-04-08 Tong Xie , Jiawang Zhao , Zishen Wan , Zuodong Zhang , Yuan Wang , Runsheng Wang , Ru Huang , Meng Li

Exposing Reliability Degradation and Mitigation in Approximate DNNs under Permanent Faults

Approximate computing is known for enhancing deep neural network accelerators' energy efficiency by introducing inexactness with a tolerable accuracy loss. However, small accuracy variations may increase the sensitivity of these…

Hardware Architecture · Computer Science 2023-02-23 Ayesha Siddique , Khaza Anuarul Hoque

FAT: Training Neural Networks for Reliable Inference Under Hardware Faults

Deep neural networks (DNNs) are state-of-the-art algorithms for multiple applications, spanning from image classification to speech recognition. While providing excellent accuracy, they often have enormous compute and memory requirements.…

Machine Learning · Computer Science 2020-11-12 Ussama Zahid , Giulio Gambardella , Nicholas J. Fraser , Michaela Blott , Kees Vissers

Enhancing Fault Tolerance of Neural Networks for Security-Critical Applications

Neural Networks (NN) have recently emerged as backbone of several sensitive applications like automobile, medical image, security, etc. NNs inherently offer Partial Fault Tolerance (PFT) in their architecture; however, the biased PFT of NNs…

Machine Learning · Computer Science 2019-02-14 Manaar Alam , Arnab Bag , Debapriya Basu Roy , Dirmanto Jap , Jakub Breier , Shivam Bhasin , Debdeep Mukhopadhyay

Intelligent Proactive Fault Tolerance at the Edge through Resource Usage Prediction

The proliferation of demanding applications and edge computing establishes the need for an efficient management of the underlying computing infrastructures, urging the providers to rethink their operational methods. In this paper, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-13 Theodoros Theodoropoulos , John Violos , Stylianos Tsanakas , Aris Leivadeas , Konstantinos Tserpes , Theodora Varvarigou

Adaptive Soft Error Protection for Neural Network Processing

Previous research on selective protection for neural network components typically exploits only static vulnerability differences. Although these methods improve upon classical modular redundancy, they still incur substantial overhead for…

Machine Learning · Computer Science 2026-04-24 Xinghua Xue , Cheng Liu , Feng Min , Yinhe Han

FT-Transformer: Resilient and Reliable Transformer with End-to-End Fault Tolerant Attention

Transformer models rely on High-Performance Computing (HPC) resources for inference, where soft errors are inevitable in large-scale systems, making the reliability of the model particularly critical. Existing fault tolerance frameworks for…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-14 Huangliang Dai , Shixun Wu , Jiajun Huang , Zizhe Jian , Yue Zhu , Haiyang Hu , Zizhong Chen

Exploring Fault-Energy Trade-offs in Approximate DNN Hardware Accelerators

Systolic array-based deep neural network (DNN) accelerators have recently gained prominence for their low computational cost. However, their high energy consumption poses a bottleneck to their deployment in energy-constrained devices. To…

Machine Learning · Computer Science 2021-01-11 Ayesha Siddique , Kanad Basu , Khaza Anuarul Hoque

Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey

Deep Neural Networks (DNNs) are very popular because of their high performance in various cognitive tasks in Machine Learning (ML). Recent advancements in DNNs have brought beyond human accuracy in many tasks, but at the cost of high…

Hardware Architecture · Computer Science 2022-03-18 Giorgos Armeniakos , Georgios Zervakis , Dimitrios Soudris , Jörg Henkel

ApproxDBN: Approximate Computing for Discriminative Deep Belief Networks

Probabilistic generative neural networks are useful for many applications, such as image classification, speech recognition and occlusion removal. However, the power budget for hardware implementations of neural networks can be extremely…

Neural and Evolutionary Computing · Computer Science 2017-05-09 Xiaojing Xu , Srinjoy Das , Ken Kreutz-Delgado

AX-DBN: An Approximate Computing Framework for the Design of Low-Power Discriminative Deep Belief Networks

The power budget for embedded hardware implementations of Deep Learning algorithms can be extremely tight. To address implementation challenges in such domains, new design paradigms, like Approximate Computing, have drawn significant…

Image and Video Processing · Electrical Eng. & Systems 2019-03-27 Ian Colbert , Ken Kreutz-Delgado , Srinjoy Das

EPSILON: Adaptive Fault Mitigation in Approximate Deep Neural Network using Statistical Signatures

The increasing adoption of approximate computing in deep neural network accelerators (AxDNNs) promises significant energy efficiency gains. However, permanent faults in AxDNNs can severely degrade their performance compared to their…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-30 Khurram Khalil , Khaza Anuarul Hoque

CorrectNet: Robustness Enhancement of Analog In-Memory Computing for Neural Networks by Error Suppression and Compensation

The last decade has witnessed the breakthrough of deep neural networks (DNNs) in many fields. With the increasing depth of DNNs, hundreds of millions of multiply-and-accumulate (MAC) operations need to be executed. To accelerate such…

Hardware Architecture · Computer Science 2022-11-29 Amro Eldebiky , Grace Li Zhang , Georg Boecherer , Bing Li , Ulf Schlichtmann