English
Related papers

Related papers: Recomputation Enabled Efficient Checkpointing

200 papers

The fault tolerance method currently used in High Performance Computing (HPC) is the rollback-recovery method by using checkpoints. This, like any other fault tolerance method, adds an additional energy consumption to that of the execution…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-05 Marina Moran , Javier Balladini , Dolores Rexachs , Emilio Luque

Grid computing is a collection of computer resources that are gathered together from various areas to give computational resources such as storage, data or application services. This is to permit clients to access this huge measure of…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-05 Garba Aliyu , Kana A. F. D. , Abdullahi Mohammed , Idris Abdulmumin , Shehu Adamu , Fatsuma Jauro

Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-30 Dingwen Tao , Sheng Di , Xin Liang , Zizhong Chen , Franck Cappello

Seismic inversion and imaging are adjoint-based optimization problems that process up to terabytes of data, regularly exceeding the memory capacity of available computers. Data compression is an effective strategy to reduce this memory…

Computational Engineering, Finance, and Science · Computer Science 2021-09-21 Navjot Kukreja , Jan Hueckelheim , Mathias Louboutin , Fabio Luporini , Paul Hovland , Gerard Gorman

In this paper, we aim at minimizing the energy consumption when executing a divisible workload under a bound on the total execution time, while resilience is provided through checkpointing. We discuss several variants of this multi-criteria…

Data Structures and Algorithms · Computer Science 2013-02-18 Guillaume Aupy , Anne Benoit , Rami Melhem , Paul Renaud-Goud , Yves Robert

Scientific workflows have been predominantly used for complex and large scale data analysis and scientific computation/automation and the need for robust workflow scheduling techniques has grown considerably. But, most of the existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-04 S. Jaya Nirmala , Amrith Rajagopal Setlur , Har Simrat Singh , Sudhanshu Khoriya

After power is switched on, recovering the interrupted program from the initial state can cause negative impact. Some programs are even unrecoverable. To rapid recovery of program execution under power failures, the execution states of…

Operating Systems · Computer Science 2022-09-20 Min Jia , Edwin Hsing. -M. Sha , Qingfeng Zhuge , Rui Xu , Shouzhen Gu

Many tasks are subject to failure before completion. Two of the most common failure recovery strategies are restart and checkpointing. Under restart, once a failure occurs, it is restarted from the beginning. Under checkpointing, the task…

Probability · Mathematics 2018-05-15 Antonio Sodre

NVM-based systems are naturally fit candidates for incorporating periodic checkpointing (or snapshotting). This increases the reliability of the system, makes it more immune to power failures, and reduces wasted work in especially an HPC…

Hardware Architecture · Computer Science 2023-01-30 Akshin Singh , Smruti R. Sarangi

Common resource management methods in supercomputing systems usually include hard divisions, capping, and quota allotment. Those methods, despite their 'advantages', have some known serious disadvantages including unoptimized utilization of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-26 Kfir Zvi , Gal Oren

Recurrent neural networks (RNNs) are valued for their computational efficiency and reduced memory requirements on tasks involving long sequence lengths but require high memory-processor bandwidth to train. Checkpointing techniques can…

Neural and Evolutionary Computing · Computer Science 2024-12-17 Wadjih Bencheikh , Jan Finkbeiner , Emre Neftci

Stochastic resetting, a method for accelerating target search in random processes, often incurs temporal and energetic costs. For a diffusing particle, a lower bound exists for the energetic cost of reaching the target, which is attained at…

Statistical Mechanics · Physics 2024-09-17 Ofir Tal-Friedman , Tommer D. Keidar , Shlomi Reuveni , Yael Roichman

This paper is dedicated to an efficient compression of weights and optimizer states (called checkpoints) obtained at different stages during a neural network training process. First, we propose a prediction-based compression approach, where…

Machine Learning · Computer Science 2025-06-16 Yuriy Kim , Evgeny Belyaev

This paper presents Checkmate, a system that enables per-iteration checkpointing in DNN training without any training slowdown. The traditional approach to checkpointing requires a pause in training to copy model states to a separate…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-21 Ankit Bhardwaj , Weiyang Wang , Jeremy Carin , Adam Belay , Manya Ghobadi

Nowadays, improving the energy efficiency of high-performance computing (HPC) systems is one of the main drivers in scientific and technological research. As large-scale HPC systems require some fault-tolerant method, the opportunities to…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-15 Marina Moran , Javier Balladini , Dolores Rexachs , Enzo Rucci

While checkpointing is typically combined with a restart of the whole application, localized recovery permits all but the affected processes to continue. In task-based cluster programming, for instance, the application can then be finished…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-01 Claudia Fohry

The reliability of concurrent and distributed systems often depends on some well-known techniques for fault tolerance. One such technique is based on checkpointing and rollback recovery. Checkpointing involves processes to take snapshots of…

Programming Languages · Computer Science 2023-11-15 Germán Vidal

This paper tackles the problem of making complex resource-constrained cyber-physical systems (CPS) resilient to sensor anomalies. In particular, we present a framework for checkpointing and roll-forward recovery of state-estimates in…

Systems and Control · Electrical Eng. & Systems 2023-01-02 Kaustubh Sridhar , Radoslav Ivanov , Vuk Lesi , Marcio Juliato , Manoj Sastry , Lily Yang , James Weimer , Oleg Sokolsky , Insup Lee

Self-powered intermittent systems typically adopt runtime checkpointing as a means to accumulate computation progress across power cycles and recover system status from power failures. However, existing approaches based on the checkpointing…

Operating Systems · Computer Science 2019-10-14 Wei-Ming Chen , Tei-Wei-Kuo , Pi-Cheng Hsiu

Checkpointing is an indispensable technique to provide fault tolerance for long-running high-throughput applications like those running on desktop grids. This paper argues that a dedicated checkpoint storage system, optimized to operate in…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-11-10 Samer Al Kiswany , Matei Ripeanu , Sudharshan S. Vazhkudai , Abdullah Gharaibeh
‹ Prev 1 2 3 10 Next ›