English
Related papers

Related papers: ReStore: In-Memory REplicated STORagE for Rapid Re…

200 papers

Faults in high-performance systems are expected to be very large in the current exascale computing era. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a much higher…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-15 Sarthak Joshi , Sathish Vadhiyar

As we have entered Exascale computing, the faults in high-performance systems are expected to increase considerably. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-26 Sarthak Joshi , Sathish Vadhiyar

Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-16 Giorgis Georgakoudis , Luanzheng Guo , Ignacio Laguna

Production MPI codes need checkpoint-restart (CPR) support. Clearly, checkpoint-restart libraries must be fault tolerant lest they open up a window of vulnerability for failures with byzantine outcomes. But, certain popular libraries that…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-22 Anthony Skjellum , Derek Schafer

Distributed in-memory datastores underpin cloud applications that run within a datacenter and demand high performance, strong consistency, and availability. A key feature of datastores is data replication. The data are replicated across…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-07 Antonios Katsarakis

Applications in science and engineering often require huge computational resources for solving problems within a reasonable time frame. Parallel supercomputers provide the computational infrastructure for solving such problems. A…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 Rajesh Sudarsan , Calvin J. Ribbens

Traditional parallel schedulers running on cluster supercomputers support only static scheduling, where the number of processors allocated to an application remains fixed throughout the execution of the job. This results in…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-06-15 Rajesh Sudarsan , Calvin J. Ribbens

In-memory key-value stores provide consistent low-latency access to all objects which is important for interactive large-scale applications like social media networks or online graph analytics and also opens up new application areas. But,…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-07-17 Kevin Beineke , Stefan Nothaas , Michael Schoettner

In case of multiple node failures performance becomes very low as compare to single node failure. Failures of nodes in cluster computing can be tolerated by multiple fault tolerant computing. Existing recovery schemes are efficient for…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-02-15 Sanjay Bansal , Sanjeev Sharma

Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. HPC applications typically tolerate fail-stop failures by checkpointing. Another…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-06-22 Erlin Yao , Mingyu Chen , Rui Wang , Wenli Zhang , Guangming Tan

Distributed algorithms that operate in the fail-recovery model rely on the state stored in stable memory to guarantee the irreversibility of operations even in the presence of failures. The performance of these algorithms lean heavily on…

Operating Systems · Computer Science 2020-02-19 William B. Mingardi , Gustavo M. D. Vieira

Distributed Hash Tables offer a resilient lookup service for unstable distributed environments. Resilient data storage, however, requires additional data replication and maintenance algorithms. These algorithms can have an impact on both…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 Matthew Leslie

Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-13 Julien Adam , Maxime Kermarquer , Jean-Baptiste Besnard , Leonardo Bautista-Gomez , Marc Perache , Patrick Carribault , Julien Jaeger , Allen D. Malony , Sameer Shende

Distributed storage systems support failures of individual devices by the use of replication or erasure correcting codes. While erasure correcting codes offer a better storage efficiency than replication for similar fault tolerance, they…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-12-10 Nicolas Le Scouarnec

This research paper investigates how machine learning-driven data replication strategies can enhance fault tolerance in large-scale distributed systems. Traditional replication methods, which rely on static configurations, often struggle to…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-18 Almond Kiruthu Murimi

Efficient utilization of today's high-performance computing (HPC) systems with complex hardware and software components requires that the HPC applications are designed to tolerate process failures at runtime. With low mean time to failure…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-16 Rizwan A. Ashraf , Saurabh Hukerikar , Christian Engelmann

Application partitioning and code offloading are being researched extensively during the past few years. Several frameworks for code offloading have been proposed. However, fewer works attempted to address issues occurred with its…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-21 Nevin Vunka Jungum , Nawaz Mohamudally , Nimal Nissanke

Search is a key service within constraint programming systems, and it demands the restoration of previously accessed states during the exploration of a search tree. Restoration proceeds either bottom-up within the tree to roll back…

Programming Languages · Computer Science 2016-02-05 Yong Lin , Martin Henz

The reliability of concurrent and distributed systems often depends on some well-known techniques for fault tolerance. One such technique is based on checkpointing and rollback recovery. Checkpointing involves processes to take snapshots of…

Programming Languages · Computer Science 2023-11-15 Germán Vidal

Today's datacenter applications rely on datastores that are required to provide high availability, consistency, and performance. To achieve high availability, these datastores replicate data across several nodes. Such replication is managed…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-25 M. R. Siavash Katebzadeh , Antonios Katsarakis , Boris Grot
‹ Prev 1 2 3 10 Next ›