English
Related papers

Related papers: Algorithm-Based Fault Tolerance for Parallel Stenc…

200 papers

With the increasing deployment of deep neural networks (DNNs) in terrestrial and aerospace safety-critical applications, system reliability has emerged as a co-equal design metric alongside computational efficiency. Algorithm-based fault…

Cryptography and Security · Computer Science 2025-04-22 Xinghua Xue , Cheng Liu , Feng Min , Tao Luo , Yinhe Han

Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-26 Kai Zhao , Sheng Di , Sihuan Li , Xin Liang , Yujia Zhai , Jieyang Chen , Kaiming Ouyang , Franck Cappello , Zizhong Chen

Neural networks (NNs) are increasingly employed in safety-critical domains and in environments prone to unreliability (e.g., soft errors), such as on spacecraft. Therefore, it is critical to impart fault tolerance to NN inference.…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-08 Jack Kosaian , K. V. Rashmi

Hardware reliability is adversely affected by the downscaling of semiconductor devices and the scale-out of systems necessitated by modern applications. Apart from crashes, this unreliability often manifests as silent data corruptions…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-30 Vassilis Vassiliadis , Konstantinos Parasyris , Christos D. Antonopoulos , Spyros Lalis , Nikolaos Bellas

Massive MIMO systems have the potential to significantly enhance spectral efficiency, yet their widespread integration is hindered by the high power consumption of the underlying computations. This paper explores the applicability and…

Signal Processing · Electrical Eng. & Systems 2025-09-09 Mikael Rinkinen , Mehdi Safarpour , Shahriar Shahabuddin , Olli Silven , Lauri Koskinen

Transformers and large language models (LLMs), powered by the attention mechanism, have transformed numerous AI applications, driving the need for specialized hardware accelerators. A major challenge in these accelerators is efficiently…

Machine Learning · Computer Science 2025-07-23 Vasileios Titopoulos , Kosmas Alexandridis , Giorgos Dimitrakopoulos

Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. HPC applications typically tolerate fail-stop failures by checkpointing. Another…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-06-22 Erlin Yao , Mingyu Chen , Rui Wang , Wenli Zhang , Guangming Tan

Algorithm-Based Fault Tolerance (ABFT) is widely adopted to detect silent data corruptions (SDCs) in matrix multiplication, a cornerstone operation in deep learning systems. However, existing threshold determination methods face critical…

Machine Learning · Computer Science 2026-02-10 Yiheng Gao , Qin Hua , Zizhong Chen

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2008-06-20 George Bosilca , Remi Delmas , Jack Dongarra , Julien Langou

High-performance and safety-critical system architects must accurately evaluate the application-level silent data corruption (SDC) rates of processors to soft errors. Such an evaluation requires error propagation all the way from particle…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-05-05 Siva Kumar Sastry Hari , Paolo Rech , Timothy Tsai , Mark Stephenson , Arslan Zulfiqar , Michael Sullivan , Philip Shirvani , Paul Racunas , Joel Emer , Stephen W. Keckler

Future extreme-scale computer systems may expose silent data corruption (SDC) to applications, in order to save energy or increase performance. However, resilience research struggles to come up with useful abstract programming models for…

Mathematical Software · Computer Science 2014-01-15 James Elliott , Mark Hoemmen , Frank Mueller

Field Programmable Gate Arrays (FPGAs) are more prone to be affected by transient faults in presence of radiation and other environmental hazards compared to Application Specific Integrated Circuits (ASICs). Hence, error mitigation and…

Hardware Architecture · Computer Science 2015-09-24 Swagata Mandal , Rourab Paul , Suman Sau , Amlan Chakrabarti , Subhasis Chattopadhyay

Transformer models rely on High-Performance Computing (HPC) resources for inference, where soft errors are inevitable in large-scale systems, making the reliability of the model particularly critical. Existing fault tolerance frameworks for…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-14 Huangliang Dai , Shixun Wu , Jiajun Huang , Zizhe Jian , Yue Zhu , Haiyang Hu , Zizhong Chen

We present FPDetect, a low overhead approach for detecting logical errors and soft errors affecting stencil computations without generating false positives. We develop an offline analysis that tightly estimates the number of floating-point…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-06 Arnab Das , Sriram Krishnamoorthy , Ian Briggs , Ganesh Gopalakrishnan , Ramakrishna Tipireddy

GPU-based fast Fourier transform (FFT) is extremely important for scientific computing and signal processing. However, we find the inefficiency of existing FFT libraries and the absence of fault tolerance against soft error. To address…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-10 Shixun Wu , Yujia Zhai , Jinyang Liu , Jiajun Huang , Zizhe Jian , Huangliang Dai , Sheng Di , Franck Cappello , Zizhong Chen

Graph convolutional networks (GCNs) are popular for building machine-learning application for graph-structured data. This widespread adoption led to the development of specialized GCN hardware accelerators. In this work, we address a key…

Hardware Architecture · Computer Science 2024-12-25 Christodoulos Peltekis , Giorgos Dimitrakopoulos

As supercomputers grow in hardware complexity, their susceptibility to faults increases and measures need to be taken to ensure the correctness of results. Some numerical algorithms have certain characteristics that allow them to recover…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-16 Thomas Saupe , Sebastian Götschel , Thibaut Lunet , Daniel Ruprecht , Robert Speck

Too many defective compute chips are escaping existing manufacturing tests -- at least an order of magnitude more than industrial targets across all compute chip types in data centers. Silent data corruptions (SDCs) caused by test escapes,…

Moving scientific computation from high-performance computing (HPC) and cloud computing (CC) environments to devices on the edge, i.e., physically near instruments of interest, has received tremendous interest in recent years. Such edge…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-11 Christopher J. Vogl , Zachary Atkins , Alyson Fox , Agnieszka Miedlar , Colin Ponce

Fault tolerance is a critical aspect of modern computing systems, ensuring correct functionality in the presence of faults. This paper presents a comprehensive survey of fault tolerance methods and software-based mitigation techniques in…

Systems and Control · Electrical Eng. & Systems 2024-04-17 Mohammadreza Amel Solouki , Shaahin Angizi , Massimo Violante
‹ Prev 1 2 3 10 Next ›