Related papers: Recomputation Enabled Efficient Checkpointing

Checkpoint and Restart: An Energy Consumption Characterization in Clusters

The fault tolerance method currently used in High Performance Computing (HPC) is the rollback-recovery method by using checkpoints. This, like any other fault tolerance method, adds an additional energy consumption to that of the execution…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-05 Marina Moran , Javier Balladini , Dolores Rexachs , Emilio Luque

Improving Grid Computing Performance by Optimally Reducing Checkpointing Effect

Grid computing is a collection of computer resources that are gathered together from various areas to give computational resources such as storage, data or application services. This is to permit clients to access this huge measure of…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-05 Garba Aliyu , Kana A. F. D. , Abdullahi Mohammed , Idris Abdulmumin , Shehu Adamu , Fatsuma Jauro

Improving Performance of Iterative Methods by Lossy Checkponting

Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-30 Dingwen Tao , Sheng Di , Xin Liang , Zizhong Chen , Franck Cappello

Combining Checkpointing and Data Compression to Accelerate Adjoint-Based Optimization Problems

Seismic inversion and imaging are adjoint-based optimization problems that process up to terabytes of data, regularly exceeding the memory capacity of available computers. Data compression is an effective strategy to reduce this memory…

Computational Engineering, Finance, and Science · Computer Science 2021-09-21 Navjot Kukreja , Jan Hueckelheim , Mathias Louboutin , Fabio Luporini , Paul Hovland , Gerard Gorman

Energy-aware checkpointing of divisible tasks with soft or hard deadlines

In this paper, we aim at minimizing the energy consumption when executing a divisible workload under a bound on the total execution time, while resilience is provided through checkpointing. We discuss several variants of this multi-criteria…

Data Structures and Algorithms · Computer Science 2013-02-18 Guillaume Aupy , Anne Benoit , Rami Melhem , Paul Renaud-Goud , Yves Robert

An Efficient Fault Tolerant Workflow Scheduling Approach using Replication Heuristics and Checkpointing in the Cloud

Scientific workflows have been predominantly used for complex and large scale data analysis and scientific computation/automation and the need for robust workflow scheduling techniques has grown considerably. But, most of the existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-04 S. Jaya Nirmala , Amrith Rajagopal Setlur , Har Simrat Singh , Sudhanshu Khoriya

Rapid Recovery of Program Execution Under Power Failures for Embedded Systems with NVM

After power is switched on, recovering the interrupted program from the initial state can cause negative impact. Some programs are even unrecoverable. To rapid recovery of program execution under power failures, the execution states of…

Operating Systems · Computer Science 2022-09-20 Min Jia , Edwin Hsing. -M. Sha , Qingfeng Zhuge , Rui Xu , Shouzhen Gu

Asymptotic efficiency of restart and checkpointing

Many tasks are subject to failure before completion. Two of the most common failure recovery strategies are restart and checkpointing. Under restart, once a failure occurs, it is restarted from the beginning. Under checkpointing, the task…

Probability · Mathematics 2018-05-15 Antonio Sodre

JASS: A Flexible Checkpointing System for NVM-based Systems

NVM-based systems are naturally fit candidates for incorporating periodic checkpointing (or snapshotting). This increases the reliability of the system, makes it more immune to power failures, and reduces wasted work in especially an HPC…

Hardware Architecture · Computer Science 2023-01-30 Akshin Singh , Smruti R. Sarangi

Optimized Memoryless Fair-Share HPC Resources Scheduling using Transparent Checkpoint-Restart Preemption

Common resource management methods in supercomputing systems usually include hard divisions, capping, and quota allotment. Those methods, despite their 'advantages', have some known serious disadvantages including unoptimized utilization of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-26 Kfir Zvi , Gal Oren

Optimal Gradient Checkpointing for Sparse and Recurrent Architectures using Off-Chip Memory

Recurrent neural networks (RNNs) are valued for their computational efficiency and reduced memory requirements on tasks involving long sequence lengths but require high memory-processor bandwidth to train. Checkpointing techniques can…

Neural and Evolutionary Computing · Computer Science 2024-12-17 Wadjih Bencheikh , Jan Finkbeiner , Emre Neftci

Smart Resetting: An Energy-Efficient Strategy for Stochastic Search Processes

Stochastic resetting, a method for accelerating target search in random processes, often incurs temporal and energetic costs. For a diffusing particle, a lower bound exists for the energetic cost of reaching the target, which is attained at…

Statistical Mechanics · Physics 2024-09-17 Ofir Tal-Friedman , Tommer D. Keidar , Shlomi Reuveni , Yael Roichman

An Efficient Compression of Deep Neural Network Checkpoints Based on Prediction and Context Modeling

This paper is dedicated to an efficient compression of weights and optimizer states (called checkpoints) obtained at different stages during a neural network training process. First, we propose a prediction-based compression approach, where…

Machine Learning · Computer Science 2025-06-16 Yuriy Kim , Evgeny Belyaev

Checkmate: Zero-Overhead Model Checkpointing via Network Gradient Replication

This paper presents Checkmate, a system that enables per-iteration checkpointing in DNN training without any training slowdown. The traditional approach to checkpointing requires a pause in training to copy model states to a separate…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-21 Ankit Bhardwaj , Weiyang Wang , Jeremy Carin , Adam Belay , Manya Ghobadi

Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems

Nowadays, improving the energy efficiency of high-performance computing (HPC) systems is one of the main drivers in scientific and technological research. As large-scale HPC systems require some fault-tolerant method, the opportunities to…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-15 Marina Moran , Javier Balladini , Dolores Rexachs , Enzo Rucci

Checkpointing and Localized Recovery for Nested Fork-Join Programs

While checkpointing is typically combined with a restart of the whole application, localized recovery permits all but the affected processes to continue. In task-based cluster programming, for instance, the application can then be finished…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-01 Claudia Fohry

From Reversible Computation to Checkpoint-Based Rollback Recovery for Message-Passing Concurrent Programs

The reliability of concurrent and distributed systems often depends on some well-known techniques for fault tolerance. One such technique is based on checkpointing and rollback recovery. Checkpointing involves processes to take snapshots of…

Programming Languages · Computer Science 2023-11-15 Germán Vidal

A Framework for Checkpointing and Recovery of Hierarchical Cyber-Physical Systems

This paper tackles the problem of making complex resource-constrained cyber-physical systems (CPS) resilient to sensor anomalies. In particular, we present a framework for checkpointing and roll-forward recovery of state-estimates in…

Systems and Control · Electrical Eng. & Systems 2023-01-02 Kaustubh Sridhar , Radoslav Ivanov , Vuk Lesi , Marcio Juliato , Manoj Sastry , Lily Yang , James Weimer , Oleg Sokolsky , Insup Lee

Enabling Failure-resilient Intermittent Systems Without Runtime Checkpointing

Self-powered intermittent systems typically adopt runtime checkpointing as a means to accumulate computation progress across power cycles and recover system status from power failures. However, existing approaches based on the checkpointing…

Operating Systems · Computer Science 2019-10-14 Wei-Ming Chen , Tei-Wei-Kuo , Pi-Cheng Hsiu

stdchk: A Checkpoint Storage System for Desktop Grid Computing

Checkpointing is an indispensable technique to provide fault tolerance for long-running high-throughput applications like those running on desktop grids. This paper argues that a dedicated checkpoint storage system, optimized to operate in…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-11-10 Samer Al Kiswany , Matei Ripeanu , Sudharshan S. Vazhkudai , Abdullah Gharaibeh