English
Related papers

Related papers: Fault Tolerance for Remote Memory Access Programmi…

200 papers

Remote Direct Memory Access (RDMA) is a technology that allows direct memory access from the memory of one computer into that of another without involving either one's operating system. This enables high-throughput, low-latency networking,…

Logic in Computer Science · Computer Science 2026-05-12 Parosh Aziz Abdulla , Mohamed Faouzi Atig , Govind Rajanbabu , Stephan Spengler

Remote memory access (RMA) is an emerging high-performance programming model that uses RDMA hardware directly. Yet, accessing remote memories cannot invoke activities at the target which complicates implementation and limits performance of…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-20 Maciej Besta , Torsten Hoefler

Remote Direct Memory Access (RDMA) is becoming widely available in data centers. This technology allows a process to directly read and write the memory of a remote host, with a mechanism to control access permissions. In this paper, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-26 Marcos K. Aguilera , Naama Ben-David , Rachid Guerraoui , Virendra Marathe , Igor Zablotchi

In-memory key-value stores provide consistent low-latency access to all objects which is important for interactive large-scale applications like social media networks or online graph analytics and also opens up new application areas. But,…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-07-17 Kevin Beineke , Stefan Nothaas , Michael Schoettner

Resistive random-access memory (RRAM) is gaining popularity due to its ability to offer computing within the memory and its non-volatile nature. The unique properties of RRAM, such as binary switching, multi-state switching, and device…

Emerging Technologies · Computer Science 2024-07-08 Simranjeet Singh , Farhad Merchant , Sachin Patkar

Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. HPC applications typically tolerate fail-stop failures by checkpointing. Another…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-06-22 Erlin Yao , Mingyu Chen , Rui Wang , Wenli Zhang , Guangming Tan

The idle computers on a local area, campus area, or even wide area network represent a significant computational resource---one that is, however, also unreliable, heterogeneous, and opportunistic. This type of resource has been used…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 Adriana Iamnitchi , Ian Foster

As we have entered Exascale computing, the faults in high-performance systems are expected to increase considerably. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-26 Sarthak Joshi , Sathish Vadhiyar

Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-01 Robert Gerstenberger , Maciej Besta , Torsten Hoefler

It is commonly agreed that highly parallel software on Exascale computers will suffer from many more runtime failures due to the decreasing trend in the mean time to failures (MTTF). Therefore, it is not surprising that a lot of research is…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-05-19 Faisal Shahzad , Moritz Kreutzer , Thomas Zeiser , Rui Machado , Andreas Pieper , Georg Hager , Gerhard Wellein

Faults in high-performance systems are expected to be very large in the current exascale computing era. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a much higher…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-15 Sarthak Joshi , Sathish Vadhiyar

Reliability of complex Cyber-Physical Systems is necessary to guarantee availability and/or safety of the provided services. Diverse and complex fault tolerance policies are adopted to enhance reliability, that include a varied mix of…

Software Engineering · Computer Science 2022-08-26 Alessandro Fantechi , Gloria Gori , Marco Papini

Remote Memory Access (RMA), also known as single sided communications, provides a way of accessing the memory of other processes without having to issue explicit message passing style communication calls. Previous studies have concluded…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-27 Nick Brown , Michael Bareford , Michèle Weiland

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2008-06-20 George Bosilca , Remi Delmas , Jack Dongarra , Julien Langou

Memory spatial errors, i.e., buffer overflow vulnerabilities, have been a well-known issue in computer security for a long time and remain one of the root causes of exploitable vulnerabilities. Most of the existing mitigation tools adopt a…

Cryptography and Security · Computer Science 2020-04-07 Dongwei Chen , Daliang Xu , Dong Tong , Kang Sun , Xuetao Guan , Chun Yang , Xu Cheng

This paper summarizes our work on characterizing application memory error vulnerability to optimize datacenter cost via Heterogeneous-Reliability Memory (HRM), which was published in DSN 2014, and examines the work's significance and future…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-11 Yixin Luo , Sriram Govindan , Bikash Sharma , Mark Santaniello , Justin Meza , Aman Kansal , Jie Liu , Badriddine Khessib , Kushagra Vaid , Onur Mutlu

Application partitioning and code offloading are being researched extensively during the past few years. Several frameworks for code offloading have been proposed. However, fewer works attempted to address issues occurred with its…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-21 Nevin Vunka Jungum , Nawaz Mohamudally , Nimal Nissanke

As memory technologies continue to shrink and memory error rates increase, the demand for stronger reliability becomes increasingly critical. Fine-grain memory replication has emerged as an appealing approach to improving memory fault…

Hardware Architecture · Computer Science 2025-02-25 Haris Volos , Yiannakis Sazeides

Understanding application resilience (or error tolerance) in the presence of hardware transient faults on data objects is critical to ensure computing integrity and enable efficient application-level fault tolerance mechanisms. However, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-16 Luanzheng Guo , Dong Li

Fault tolerance is a critical aspect of modern computing systems, ensuring correct functionality in the presence of faults. This paper presents a comprehensive survey of fault tolerance methods and software-based mitigation techniques in…

Systems and Control · Electrical Eng. & Systems 2024-04-17 Mohammadreza Amel Solouki , Shaahin Angizi , Massimo Violante
‹ Prev 1 2 3 10 Next ›