Related papers: CheckSync: Using Runtime-Integrated Checkpoints to…
This paper presents an empirical study on the feasibility of using Checkpoint/Restore In Userspace (CRIU) for run-time application migration between hosts, with a particular focus on edge computing and cloud infrastructures. The paper…
The reliability of concurrent and distributed systems often depends on some well-known techniques for fault tolerance. One such technique is based on checkpointing and rollback recovery. Checkpointing involves processes to take snapshots of…
Checkpointing is an indispensable technique to provide fault tolerance for long-running high-throughput applications like those running on desktop grids. This paper argues that a dedicated checkpoint storage system, optimized to operate in…
Grid computing is a collection of computer resources that are gathered together from various areas to give computational resources such as storage, data or application services. This is to permit clients to access this huge measure of…
Recovery from transient failures is one of the prime issues in the context of distributed systems. These systems demand to have transparent yet efficient techniques to achieve the same. Checkpoint is defined as a designated place in a…
It is common today to deploy complex software inside a virtual machine (VM). Snapshots provide rapid deployment, migration between hosts, dependability (fault tolerance), and security (insulating a guest VM from the host). Yet, for each…
We demonstrate a fully functional implementation of (per-user) checkpoint, restore, and live migration capabilities for JupyterHub platforms. Checkpointing -- the ability to freeze and suspend to disk the running state (contents of memory,…
NVM-based systems are naturally fit candidates for incorporating periodic checkpointing (or snapshotting). This increases the reliability of the system, makes it more immune to power failures, and reduces wasted work in especially an HPC…
Creating new materials, discovering new drugs, and simulating systems are essential processes for research and innovation and require substantial computational power. While many applications can be split into many smaller independent tasks,…
We present CheckSoft, a scalable event-driven software architecture for keeping track of people-object interactions in people-centric applications such as airport checkpoint security areas, automated retail stores, smart libraries, and so…
We consider the problem of checkpointing a distributed application efficiently in Content Centric Networks so that it can withstand transient failures. We present CCNCheck, a system which enables a sender optimized way of checkpointing…
In scientific computing and data science disciplines, it is often necessary to share application workflows and repeat results. Current tools containerize application workflows, and share the resulting container for repeating results. These…
Selecting optimal intervals of checkpointing an application is important for minimizing the run time of the application in the presence of system failures. Most of the existing efforts on checkpointing interval selection were developed for…
Deep learning training at scale is resource-intensive and time-consuming, often running across hundreds or thousands of GPUs for weeks or months. Efficient checkpointing is crucial for running these workloads, especially in multi-tenant…
High-Performance Computing (HPC) applications need to checkpoint massive amounts of data at scale. Multi-level asynchronous checkpoint runtimes like VELOC (Very Low Overhead Checkpoint Strategy) are gaining popularity among application…
The use of annotations, referred to as assertions or contracts, to describe program properties for which run-time tests are to be generated, has become frequent in dynamic programing languages. However, the frameworks proposed to support…
Spot instances offer a cost-effective solution for applications running in the cloud computing environment. However, it is challenging to run long-running jobs on spot instances because they are subject to unpredictable evictions. Here, we…
Checkpoint/Restart (C/R) saves the running state of the programs periodically, which consumes considerable system resources. We observe that not every piece of data is involved in the computation in typical HPC applications; such unused…
Fault injections are increasingly used to attack/test secure applications. In this paper, we define formal models of runtime monitors that can detect fault injections that result in test inversion attacks and arbitrary jumps in the control…
As cloud computing continues to advance and become an integral part of modern IT infrastructure, container security has emerged as a critical factor in ensuring the smooth operation of cloud-native applications. An attacker can attack the…