Related papers: stdchk: A Checkpoint Storage System for Desktop Gr…

Improving Grid Computing Performance by Optimally Reducing Checkpointing Effect

Grid computing is a collection of computer resources that are gathered together from various areas to give computational resources such as storage, data or application services. This is to permit clients to access this huge measure of…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-05 Garba Aliyu , Kana A. F. D. , Abdullahi Mohammed , Idris Abdulmumin , Shehu Adamu , Fatsuma Jauro

Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets

High-performance computing (HPC) requires resilience techniques such as checkpointing in order to tolerate failures in supercomputers. As the number of nodes and memory in supercomputers keeps on increasing, the size of checkpoint data also…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-13 Kai Keller , Leonardo Bautista Gomez

CheckSync: Using Runtime-Integrated Checkpoints to Achieve High Availability}

CheckSync provides applications with high availability via runtime-integrated checkpointing. This allows CheckSync to take checkpoints of a process running in a memory-managed language (Go, for now), which can be resumed on another machine…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-08 Nicolaas Kaashoek , Robert Morris

DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed Caching

Load balancing is critical for distributed storage to meet strict service-level objectives (SLOs). It has been shown that a fast cache can guarantee load balancing for a clustered storage system. However, when the system scales out to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-02-18 Zaoxing Liu , Zhihao Bai , Zhenming Liu , Xiaozhou Li , Changhoon Kim , Vladimir Braverman , Xin Jin , Ion Stoica

A Comparative Study of Replication Techniques in Grid Computing Systems

Grid Computing is a type of parallel and distributed systems that is designed to provide reliable access to data and computational resources in wide area networks. These resources are distributed in different geographical locations, however…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-09-27 Sheida Dayyani , Mohammad Reza Khayyambashi

Extending the OpenCHK Model with Advanced Checkpoint Features

One of the major challenges in using extreme scale systems efficiently is to mitigate the impact of faults. Application-level checkpoint/restart (CR) methods provide the best trade-off between productivity, robustness, and performance.…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-02 Marcos Maroñas , Sergi Mateo , Kai Keller , Leonardo Bautista-Gomez , Eduard Ayguadé , Vicenç Beltran

Technical Report: Efficient Buffering and Scheduling for a Single-Chip Crosspoint-Queued Switch

The single-chip crosspoint-queued (CQ) switch is a compact switching architecture that has all its buffers placed at the crosspoints of input and output lines. Scheduling is also performed inside the switching core, and does not rely on…

Networking and Internet Architecture · Computer Science 2014-03-11 Zizhong Cao , Shivendra S. Panwar

CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-08 Faisal Shahzad , Jonas Thies , Moritz Kreutzer , Thomas Zeiser , Georg Hager , Gerhard Wellein

Recomputation Enabled Efficient Checkpointing

Systematic checkpointing of the machine state makes restart of execution from a safe state possible upon detection of an error. The time and energy overhead of checkpointing, however, grows with the frequency of checkpointing. Amortizing…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-30 Ismail Akturk , Ulya R. Karpuzcu

Cost-aware Joint Caching and Forwarding in Networks with Heterogeneous Cache Resources

Caching is crucial for enabling high-throughput networks for data intensive applications. Traditional caching technology relies on DRAM, as it can transfer data at a high rate. However, DRAM capacity is subject to contention by most system…

Networking and Internet Architecture · Computer Science 2023-10-12 Faruk Volkan Mutlu , Edmund Yeh

Spot-on: A Checkpointing Framework for Fault-Tolerant Long-running Workloads on Cloud Spot Instances

Spot instances offer a cost-effective solution for applications running in the cloud computing environment. However, it is challenging to run long-running jobs on spot instances because they are subject to unpredictable evictions. Here, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-07 Ashley Tung , Haiyan Wang , Yue Li , Zhong Wang , Jingchao Sun

Improving Performance of Iterative Methods by Lossy Checkponting

Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-30 Dingwen Tao , Sheng Di , Xin Liang , Zizhong Chen , Franck Cappello

A Survey on User-Space Storage and Its Implementations

The storage stack in the traditional operating system is primarily optimized towards improving the CPU utilization and hiding the long I/O latency imposed by the slow I/O devices such as hard disk drivers (HDDs). However, the emerging…

Operating Systems · Computer Science 2023-06-21 Junzhe Li , Xiurui Pan , Shushu Yi , Jie Zhang

CCNCheck: Enabling Checkpointed Distributed Applications in Content Centric Networks

We consider the problem of checkpointing a distributed application efficiently in Content Centric Networks so that it can withstand transient failures. We present CCNCheck, a system which enables a sender optimized way of checkpointing…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-06-02 Nitinder Mohan , Pushpendra Singh

Optimal Multi-Level Interval-based Checkpointing for Exascale Stream Processing Systems

State-of-the-art stream processing platforms make use of checkpointing to support fault tolerance, where a "checkpoint tuple" flows through the topology to all operators, indicating a checkpoint and triggering a checkpoint operation. The…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-12-17 Sachini Jayasekara , Aaron Harwood , Shanika Karunasekera

Lightweight Fault Tolerance in Large-Scale Distributed Graph Processing

The success of Google's Pregel framework in distributed graph processing has inspired a surging interest in developing Pregel-like platforms featuring a user-friendly "think like a vertex" programming model. Existing Pregel-like systems…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-01-26 Da Yan , James Cheng , Fan Yang

A Survey on Tiering and Caching in High-Performance Storage Systems

Although every individual invented storage technology made a big step towards perfection, none of them is spotless. Different data store essentials such as performance, availability, and recovery requirements have not met together in a…

Hardware Architecture · Computer Science 2019-04-29 Morteza Hoseinzadeh

Optimizing SSD Caches for Cloud Block Storage Systems Using Machine Learning Approaches

The growing demand for efficient cloud storage solutions has led to the widespread adoption of Solid-State Drives (SSDs) for caching in cloud block storage systems. The management of data writes to SSD caches plays a crucial role in…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-30 Chiyu Cheng , Chang Zhou , Yang Zhao , Jin Cao

Run-time application migration using checkpoint/restore in userspace

This paper presents an empirical study on the feasibility of using Checkpoint/Restore In Userspace (CRIU) for run-time application migration between hosts, with a particular focus on edge computing and cloud infrastructures. The paper…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-25 Aleksandar Tošić

Extending DIRAC File Management with Erasure-Coding for efficient storage

The state of the art in Grid style data management is to achieve increased resilience of data via multiple complete replicas of data files across multiple storage endpoints. While this is effective, it is not the most space-efficient…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-01-20 Samuel Cadellin Skipsey , Paulin Todev , David Britton , David Crooks , Gareth Roy