Related papers: Fault Tolerance for Remote Memory Access Programmi…

On the Verification Problem of Remote Direct Memory Access programs (Extended Version with Appendix)

Remote Direct Memory Access (RDMA) is a technology that allows direct memory access from the memory of one computer into that of another without involving either one's operating system. This enables high-throughput, low-latency networking,…

Logic in Computer Science · Computer Science 2026-05-12 Parosh Aziz Abdulla , Mohamed Faouzi Atig , Govind Rajanbabu , Stephan Spengler

Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations

Remote memory access (RMA) is an emerging high-performance programming model that uses RDMA hardware directly. Yet, accessing remote memories cannot invoke activities at the target which complicates implementation and limits performance of…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-20 Maciej Besta , Torsten Hoefler

The Impact of RDMA on Agreement

Remote Direct Memory Access (RDMA) is becoming widely available in data centers. This technology allows a process to directly read and write the memory of a remote host, with a mechanism to control access permissions. In this paper, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-26 Marcos K. Aguilera , Naama Ben-David , Rachid Guerraoui , Virendra Marathe , Igor Zablotchi

DXRAM's Fault-Tolerance Mechanisms Meet High Speed I/O Devices

In-memory key-value stores provide consistent low-latency access to all objects which is important for interactive large-scale applications like social media networks or online graph analytics and also opens up new application areas. But,…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-07-17 Kevin Beineke , Stefan Nothaas , Michael Schoettner

Resistive Memory for Computing and Security: Algorithms, Architectures, and Platforms

Resistive random-access memory (RRAM) is gaining popularity due to its ability to offer computing within the memory and its non-volatile nature. The unique properties of RRAM, such as binary switching, multi-state switching, and device…

Emerging Technologies · Computer Science 2024-07-08 Simranjeet Singh , Farhad Merchant , Sachin Patkar

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. HPC applications typically tolerate fail-stop failures by checkpointing. Another…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-06-22 Erlin Yao , Mingyu Chen , Rui Wang , Wenli Zhang , Guangming Tan

A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems

The idle computers on a local area, campus area, or even wide area network represent a significant computational resource---one that is, however, also unreliable, heterogeneous, and opportunistic. This type of resource has been used…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 Adriana Iamnitchi , Ian Foster

PartRePer-MPI: Combining Fault Tolerance and Performance for MPI Applications

As we have entered Exascale computing, the faults in high-performance systems are expected to increase considerably. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-26 Sarthak Joshi , Sathish Vadhiyar

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-01 Robert Gerstenberger , Maciej Besta , Torsten Hoefler

Building a fault tolerant application using the GASPI communication layer

It is commonly agreed that highly parallel software on Exascale computers will suffer from many more runtime failures due to the decreasing trend in the mean time to failures (MTTF). Therefore, it is not surprising that a lot of research is…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-05-19 Faisal Shahzad , Moritz Kreutzer , Thomas Zeiser , Rui Machado , Andreas Pieper , Georg Hager , Gerhard Wellein

FTHP-MPI: Towards Providing Replication-based Fault Tolerance in a Fault-Intolerant Native MPI Library

Faults in high-performance systems are expected to be very large in the current exascale computing era. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a much higher…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-15 Sarthak Joshi , Sathish Vadhiyar

Runtime reliability monitoring for complex fault-tolerance policies

Reliability of complex Cyber-Physical Systems is necessary to guarantee availability and/or safety of the provided services. Diverse and complex fault tolerance policies are adopted to enhance reliability, that include a varied mix of…

Software Engineering · Computer Science 2022-08-26 Alessandro Fantechi , Gloria Gori , Marco Papini

Leveraging MPI RMA to optimise halo-swapping communications in MONC on Cray machines

Remote Memory Access (RMA), also known as single sided communications, provides a way of accessing the memory of other processes without having to issue explicit message passing style communication calls. Previous studies have concluded…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-27 Nick Brown , Michael Bareford , Michèle Weiland

Algorithmic Based Fault Tolerance Applied to High Performance Computing

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2008-06-20 George Bosilca , Remi Delmas , Jack Dongarra , Julien Langou

Saturation Memory Access: Mitigating Memory Spatial Errors without Terminating Programs

Memory spatial errors, i.e., buffer overflow vulnerabilities, have been a well-known issue in computer security for a long time and remain one of the root causes of exploitable vulnerabilities. Most of the existing mitigation tools adopt a…

Cryptography and Security · Computer Science 2020-04-07 Dongwei Chen , Daliang Xu , Dong Tong , Kang Sun , Xuetao Guan , Chun Yang , Xu Cheng

Heterogeneous-Reliability Memory: Exploiting Application-Level Memory Error Tolerance

This paper summarizes our work on characterizing application memory error vulnerability to optimize datacenter cost via Heterogeneous-Reliability Memory (HRM), which was published in DSN 2014, and examines the work's significance and future…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-11 Yixin Luo , Sriram Govindan , Bikash Sharma , Mark Santaniello , Justin Meza , Aman Kansal , Jie Liu , Badriddine Khessib , Kushagra Vaid , Onur Mutlu

A Fault Tolerant Mechanism for Partitioning and Offloading Framework in Pervasive Environments

Application partitioning and code offloading are being researched extensively during the past few years. Several frameworks for code offloading have been proposed. However, fewer works attempted to address issues occurred with its…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-21 Nevin Vunka Jungum , Nawaz Mohamudally , Nimal Nissanke

Analyzing a Two-Tier Disaggregated Memory Protection Scheme Based on Memory Replication

As memory technologies continue to shrink and memory error rates increase, the demand for stronger reliability becomes increasingly critical. Fine-grain memory replication has emerged as an appealing approach to improving memory fault…

Hardware Architecture · Computer Science 2025-02-25 Haris Volos , Yiannakis Sazeides

MOARD: Modeling Application Resilience to Transient Faults on Data Objects

Understanding application resilience (or error tolerance) in the presence of hardware transient faults on data objects is critical to ensure computing integrity and enable efficient application-level fault tolerance mechanisms. However, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-16 Luanzheng Guo , Dong Li

Dependability in Embedded Systems: A Survey of Fault Tolerance Methods and Software-Based Mitigation Techniques

Fault tolerance is a critical aspect of modern computing systems, ensuring correct functionality in the presence of faults. This paper presents a comprehensive survey of fault tolerance methods and software-based mitigation techniques in…

Systems and Control · Electrical Eng. & Systems 2024-04-17 Mohammadreza Amel Solouki , Shaahin Angizi , Massimo Violante