Related papers: CheckSync: Using Runtime-Integrated Checkpoints to…

Run-time application migration using checkpoint/restore in userspace

This paper presents an empirical study on the feasibility of using Checkpoint/Restore In Userspace (CRIU) for run-time application migration between hosts, with a particular focus on edge computing and cloud infrastructures. The paper…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-25 Aleksandar Tošić

From Reversible Computation to Checkpoint-Based Rollback Recovery for Message-Passing Concurrent Programs

The reliability of concurrent and distributed systems often depends on some well-known techniques for fault tolerance. One such technique is based on checkpointing and rollback recovery. Checkpointing involves processes to take snapshots of…

Programming Languages · Computer Science 2023-11-15 Germán Vidal

stdchk: A Checkpoint Storage System for Desktop Grid Computing

Checkpointing is an indispensable technique to provide fault tolerance for long-running high-throughput applications like those running on desktop grids. This paper argues that a dedicated checkpoint storage system, optimized to operate in…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-11-10 Samer Al Kiswany , Matei Ripeanu , Sudharshan S. Vazhkudai , Abdullah Gharaibeh

Improving Grid Computing Performance by Optimally Reducing Checkpointing Effect

Grid computing is a collection of computer resources that are gathered together from various areas to give computational resources such as storage, data or application services. This is to permit clients to access this huge measure of…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-05 Garba Aliyu , Kana A. F. D. , Abdullahi Mohammed , Idris Abdulmumin , Shehu Adamu , Fatsuma Jauro

Analysis of Recent Checkpointing Techniques for Mobile Computing Systems

Recovery from transient failures is one of the prime issues in the context of distributed systems. These systems demand to have transparent yet efficient techniques to achieve the same. Checkpoint is defined as a designated place in a…

Networking and Internet Architecture · Computer Science 2011-09-01 Ruchi Tuli , Parveen Kumar

A Generic Checkpoint-Restart Mechanism for Virtual Machines

It is common today to deploy complex software inside a virtual machine (VM). Snapshots provide rapid deployment, migration between hosts, dependability (fault tolerance), and security (insulating a guest VM from the host). Yet, for each…

Operating Systems · Computer Science 2012-12-11 Rohan Garg , Komal Sodha , Gene Cooperman

Checkpoint, Restore, and Live Migration for Science Platforms

We demonstrate a fully functional implementation of (per-user) checkpoint, restore, and live migration capabilities for JupyterHub platforms. Checkpointing -- the ability to freeze and suspend to disk the running state (contents of memory,…

Instrumentation and Methods for Astrophysics · Physics 2021-01-15 Mario Juric , Steven Stetzler , Colin T. Slater

JASS: A Flexible Checkpointing System for NVM-based Systems

NVM-based systems are naturally fit candidates for incorporating periodic checkpointing (or snapshotting). This increases the reliability of the system, makes it more immune to power failures, and reduces wasted work in especially an HPC…

Hardware Architecture · Computer Science 2023-01-30 Akshin Singh , Smruti R. Sarangi

CRIU -- Checkpoint Restore in Userspace for computational simulations and scientific applications

Creating new materials, discovering new drugs, and simulating systems are essential processes for research and innovation and require substantial computational power. While many applications can be split into many smaller independent tasks,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-02-09 Fabio Andrijauskas , Igor Sfiligoi , Diego Davila , Aashay Arora , Jonathan Guiang , Brian Bockelman , Greg Thain , Frank Wurthwein

CheckSoft : A Scalable Event-Driven Software Architecture for Keeping Track of People and Things in People-Centric Spaces

We present CheckSoft, a scalable event-driven software architecture for keeping track of people-object interactions in people-centric applications such as airport checkpoint security areas, automated retail stores, smart libraries, and so…

Software Engineering · Computer Science 2021-02-23 Rohan Sarkar , Avinash C. Kak

CCNCheck: Enabling Checkpointed Distributed Applications in Content Centric Networks

We consider the problem of checkpointing a distributed application efficiently in Content Centric Networks so that it can withstand transient failures. We present CCNCheck, a system which enables a sender optimized way of checkpointing…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-06-02 Nitinder Mohan , Pushpendra Singh

CHEX: Multiversion Replay with Ordered Checkpoints

In scientific computing and data science disciplines, it is often necessary to share application workflows and repeat results. Current tools containerize application workflows, and share the resulting container for repeating results. These…

Databases · Computer Science 2022-02-18 Naga Nithin Manne , Shilvi Satpati , Tanu Malik , Amitabha Bagchi , Ashish Gehani , Amitabh Chaudhary

Determination of Checkpointing Intervals for Malleable Applications

Selecting optimal intervals of checkpointing an application is important for minimizing the run time of the application in the presence of system failures. Most of the existing efforts on checkpointing interval selection were developed for…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-02 K. Raghavendra , Sathish S Vadhiyar

CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads

Deep learning training at scale is resource-intensive and time-consuming, often running across hundreds or thousands of GPUs for weeks or months. Efficient checkpointing is crucial for running these workloads, especially in multi-tenant…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-25 Radostin Stoyanov , Viktória Spišaková , Jesus Ramos , Steven Gurfinkel , Andrei Vagin , Adrian Reber , Wesley Armour , Rodrigo Bruno

Towards Aggregated Asynchronous Checkpointing

High-Performance Computing (HPC) applications need to checkpoint massive amounts of data at scale. Multi-level asynchronous checkpoint runtimes like VELOC (Very Low Overhead Checkpoint Strategy) are gaining popularity among application…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-07 Mikaila J. Gossman , Bogdan Nicolae , Jon C. Calhoun , Franck Cappello , Melissa C. Smith

Practical Run-time Checking via Unobtrusive Property Caching

The use of annotations, referred to as assertions or contracts, to describe program properties for which run-time tests are to be generated, has become frequent in dynamic programing languages. However, the frameworks proposed to support…

Programming Languages · Computer Science 2020-02-19 Nataliia Stulova , José F. Morales , Manuel V. Hermenegildo

Spot-on: A Checkpointing Framework for Fault-Tolerant Long-running Workloads on Cloud Spot Instances

Spot instances offer a cost-effective solution for applications running in the cloud computing environment. However, it is challenging to run long-running jobs on spot instances because they are subject to unpredictable evictions. Here, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-07 Ashley Tung , Haiyan Wang , Yue Li , Zhong Wang , Jingchao Sun

Scrutinizing Variables for Checkpoint Using Automatic Differentiation

Checkpoint/Restart (C/R) saves the running state of the programs periodically, which consumes considerable system resources. We observe that not every piece of data is involved in the computation in typical HPC applications; such unused…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-19 Xin Huang , Weiping Zhang , Shiman Meng , Wubiao Xu , Xiang Fu , Luanzheng Guo , Kento Sato

Detecting Fault Injection Attacks with Runtime Verification

Fault injections are increasingly used to attack/test secure applications. In this paper, we define formal models of runtime monitors that can detect fault injections that result in test inversion attacks and arbitrary jumps in the control…

Cryptography and Security · Computer Science 2019-09-23 Ali Kassem , Yliès Falcone

A High-performance Real-time Container File Monitoring Approach Based on Virtual Machine Introspection

As cloud computing continues to advance and become an integral part of modern IT infrastructure, container security has emerged as a critical factor in ensuring the smooth operation of cloud-native applications. An attacker can attack the…

Cryptography and Security · Computer Science 2025-09-22 Kai Tan , Dongyang Zhan , Lin Ye , Hongli Zhang , Binxing Fang , Zhihong Tian