Related papers: Run-time application migration using checkpoint/re…

CRIU -- Checkpoint Restore in Userspace for computational simulations and scientific applications

Creating new materials, discovering new drugs, and simulating systems are essential processes for research and innovation and require substantial computational power. While many applications can be split into many smaller independent tasks,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-02-09 Fabio Andrijauskas , Igor Sfiligoi , Diego Davila , Aashay Arora , Jonathan Guiang , Brian Bockelman , Greg Thain , Frank Wurthwein

Checkpointing and Migration of IoT Edge Functions

The serverless and functions as a service (FaaS) paradigms are currently trending among cloud providers and are now increasingly being applied to the network edge, and to the Internet of Things (IoT) devices. The benefits include reduced…

Networking and Internet Architecture · Computer Science 2021-03-23 Pekka Karhula , Jan Janak , Henning Schulzrinne

Checkpoint, Restore, and Live Migration for Science Platforms

We demonstrate a fully functional implementation of (per-user) checkpoint, restore, and live migration capabilities for JupyterHub platforms. Checkpointing -- the ability to freeze and suspend to disk the running state (contents of memory,…

Instrumentation and Methods for Astrophysics · Physics 2021-01-15 Mario Juric , Steven Stetzler , Colin T. Slater

CheckSync: Using Runtime-Integrated Checkpoints to Achieve High Availability}

CheckSync provides applications with high availability via runtime-integrated checkpointing. This allows CheckSync to take checkpoints of a process running in a memory-managed language (Go, for now), which can be resumed on another machine…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-08 Nicolaas Kaashoek , Robert Morris

Performance Evaluation of Checkpoint/Restart Techniques

Distributed applications running on a large cluster environment, such as the cloud instances will have shorter execution time. However, the application might suffer from sudden termination due to unpredicted computing node failures, thus…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-30 Basma Abdel Azeem , Manal Helal

Performance Characterization of Containers in Edge Computing

Edge computing addresses critical limitations of cloud computing such as high latency and network congestion by decentralizing processing from cloud to the edge. However, the need for software replication across heterogeneous edge devices…

Performance · Computer Science 2025-05-09 Ragini Gupta , Klara Nahrstedt

CTR: Checkpoint, Transfer, and Restore for Secure Enclaves

Hardware-based Trusted Execution Environments (TEEs) are becoming increasingly prevalent in cloud computing, forming the basis for confidential computing. However, the security goals of TEEs sometimes conflict with existing cloud…

Cryptography and Security · Computer Science 2022-06-01 Yoshimichi Nakatsuka , Ercan Ozturk , Alex Shamis , Andrew Paverd , Peter Pietzuch

Extending the OpenCHK Model with Advanced Checkpoint Features

One of the major challenges in using extreme scale systems efficiently is to mitigate the impact of faults. Application-level checkpoint/restart (CR) methods provide the best trade-off between productivity, robustness, and performance.…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-02 Marcos Maroñas , Sergi Mateo , Kai Keller , Leonardo Bautista-Gomez , Eduard Ayguadé , Vicenç Beltran

Improving Grid Computing Performance by Optimally Reducing Checkpointing Effect

Grid computing is a collection of computer resources that are gathered together from various areas to give computational resources such as storage, data or application services. This is to permit clients to access this huge measure of…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-05 Garba Aliyu , Kana A. F. D. , Abdullahi Mohammed , Idris Abdulmumin , Shehu Adamu , Fatsuma Jauro

stdchk: A Checkpoint Storage System for Desktop Grid Computing

Checkpointing is an indispensable technique to provide fault tolerance for long-running high-throughput applications like those running on desktop grids. This paper argues that a dedicated checkpoint storage system, optimized to operate in…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-11-10 Samer Al Kiswany , Matei Ripeanu , Sudharshan S. Vazhkudai , Abdullah Gharaibeh

DEEP: Edge-based Dataflow Processing with Hybrid Docker Hub and Regional Registries

Reducing energy consumption is essential to lessen greenhouse gas emissions, conserve natural resources, and help mitigate the impacts of climate change. In this direction, edge computing, a complementary technology to cloud computing,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-15 Narges Mehran , Zahra Najafabadi Samani , Reza Farahani , Josef Hammer , Dragi Kimovski

CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads

Deep learning training at scale is resource-intensive and time-consuming, often running across hundreds or thousands of GPUs for weeks or months. Efficient checkpointing is crucial for running these workloads, especially in multi-tenant…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-25 Radostin Stoyanov , Viktória Spišaková , Jesus Ramos , Steven Gurfinkel , Andrei Vagin , Adrian Reber , Wesley Armour , Rodrigo Bruno

Coordinated Container Migration and Base Station Handover in Mobile Edge Computing

Offloading computationally intensive tasks from mobile users (MUs) to a virtualized environment such as containers on a nearby edge server, can significantly reduce processing time and hence end-to-end (E2E) delay. However, when users are…

Networking and Internet Architecture · Computer Science 2020-09-15 Mao V. Ngo , Tie Luo , Hieu T. Hoang , Tony Q. S. Quek

Container Resource Allocation versus Performance of Data-intensive Applications on Different Cloud Servers

In recent years, data-intensive applications have been increasingly deployed on cloud systems. Such applications utilize significant compute, memory, and I/O resources to process large volumes of data. Optimizing the performance and…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-15 Qing Wang , Snigdhaswin Kar , Prabodh Mishra , Caleb Linduff , Ryan Izard , Khayam Anjam , Geddings Barrineau , Junaid Zulfiqar , Kuang-Ching Wang

Profiling checkpointing schedules in adjoint ST-AD

Checkpointing is a cornerstone of data-flow reversal in adjoint algorithmic differentiation. Checkpointing is a storage/recomputation trade-off that can be applied at different levels, one of which being the call tree. We are looking for…

Computation and Language · Computer Science 2024-09-13 Laurent Hascoët , Jean-Luc Bouchot , Shreyas Sunil Gaikwad , Sri Hari Krishna Narayanan , Jan Hückelheim

An Adaptive Checkpointing Scheme for Peer-to-Peer Based Volunteer Computing Work Flows

Volunteer Computing, sometimes called Public Resource Computing, is an emerging computational model that is very suitable for work-pooled parallel processing. As more complex grid applications make use of work flows in their design and…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-11-27 Lei Ni , Aaron Harwood

CReIS: Computation Reuse through Image Similarity in ICN-Based Edge Computing

At the edge, there is a high level of similarity in computing. One approach that has been proposed to enhance the efficiency of edge computing is computation reuse, which eliminates redundant computations. Edge computing is integrated with…

Networking and Internet Architecture · Computer Science 2025-02-05 Atiyeh Javaheri , Ali Bohlooli , Kamal Jamshidi

From Reversible Computation to Checkpoint-Based Rollback Recovery for Message-Passing Concurrent Programs

The reliability of concurrent and distributed systems often depends on some well-known techniques for fault tolerance. One such technique is based on checkpointing and rollback recovery. Checkpointing involves processes to take snapshots of…

Programming Languages · Computer Science 2023-11-15 Germán Vidal

CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-08 Faisal Shahzad , Jonas Thies , Moritz Kreutzer , Thomas Zeiser , Georg Hager , Gerhard Wellein

Quantifying Daily Evolution of Mobile Software Based on Memory Allocator Churn

The pace and volume of code churn necessary to evolve modern software systems present challenges for analyzing the performance impact of any set of code changes. Traditional methods used in performance analysis rely on extensive data…

Software Engineering · Computer Science 2022-05-09 Gunnar Kudrjavets , Jeff Thomas , Aditya Kumar , Nachiappan Nagappan , Ayushi Rastogi