Related papers: Optimal Multi-Level Interval-based Checkpointing f…

A Utilization Model for Optimization of Checkpoint Intervals in Distributed Stream Processing Systems

State-of-the-art distributed stream processing systems such as Apache Flink and Storm have recently included checkpointing to provide fault-tolerance for stateful applications. This is a necessary eventuality as these systems head into the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-21 Sachini Jayasekara , Aaron Harwood , Shanika Karunasekera

Improving Performance of Iterative Methods by Lossy Checkponting

Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-30 Dingwen Tao , Sheng Di , Xin Liang , Zizhong Chen , Franck Cappello

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. HPC applications typically tolerate fail-stop failures by checkpointing. Another…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-06-22 Erlin Yao , Mingyu Chen , Rui Wang , Wenli Zhang , Guangming Tan

Checkpoint/restart approaches for a thread-based MPI runtime

Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-13 Julien Adam , Maxime Kermarquer , Jean-Baptiste Besnard , Leonardo Bautista-Gomez , Marc Perache , Patrick Carribault , Julien Jaeger , Allen D. Malony , Sameer Shende

Checkpointing to minimize completion time for Inter-dependent Parallel Processes on Volunteer Grids

Volunteer computing is being used successfully for large scale scientific computations. This research is in the context of Volpex, a programming framework that supports communicating parallel processes in a volunteer environment. Redundancy…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-03-14 Mohammad Tanvir Rahman , Hien Nguyen , Jaspal Subhlok , Gopal Pandurangan

Determination of Checkpointing Intervals for Malleable Applications

Selecting optimal intervals of checkpointing an application is important for minimizing the run time of the application in the presence of system failures. Most of the existing efforts on checkpointing interval selection were developed for…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-02 K. Raghavendra , Sathish S Vadhiyar

A Scalable and Extensible Checkpointing Scheme for Massively Parallel Simulations

Realistic simulations in engineering or in the materials sciences can consume enormous computing resources and thus require the use of massively parallel supercomputers. The probability of a failure increases both with the runtime and with…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-30 Nils Kohl , Johannes Hötzer , Florian Schornbaum , Martin Bauer , Christian Godenschwager , Harald Köstler , Britta Nestler , Ulrich Rüde

System-level Scalable Checkpoint-Restart for Petascale Computing

Fault tolerance for the upcoming exascale generation has long been an area of active research. One of the components of a fault tolerance strategy is checkpointing. Petascale-level checkpointing is demonstrated through a new mechanism for…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-09-27 Jiajun Cao , Kapil Arya , Rohan Garg , Shawn Matott , Dhabaleswar K. Panda , Hari Subramoni , Jérôme Vienne , Gene Cooperman

GoCkpt: Gradient-Assisted Multi-Step overlapped Checkpointing for Efficient LLM Training

The accuracy of large language models (LLMs) improves with increasing model size, but increasing model complexity also poses significant challenges to training stability. Periodic checkpointing is a key mechanism for fault recovery and is…

Operating Systems · Computer Science 2025-11-11 Keyao Zhang , Yiquan Chen , Zhuo Hu , Wenhai Lin , Jiexiong Xu , Wenzhi Chen

CheckMate: Evaluating Checkpointing Protocols for Streaming Dataflows

Stream processing in the last decade has seen broad adoption in both commercial and research settings. One key element for this success is the ability of modern stream processors to handle failures while ensuring exactly-once processing…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-21 George Siachamis , Kyriakos Psarakis , Marios Fragkoulis , Arie van Deursen , Paris Carbone , Asterios Katsifodimos

Optimal Checkpointing Period: Time vs. Energy

This short paper deals with parallel scientific applications using non-blocking and periodic coordinated checkpointing to enforce resilience. We provide a model and detailed formulas for total execution time and consumed energy. We…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-11-01 Guillaume Aupy , Anne Benoit , Thomas Hérault , Yves Robert , Jack Dongarra

Fixed Confidence and Fixed Tolerance Bi-level Optimization for Selecting the Best Optimized System

In this paper, we study a fixed-confidence, fixed-tolerance formulation of a class of stochastic bi-level optimization problems, where the upper-level problem selects from a finite set of systems based on a performance metric, and the…

Optimization and Control · Mathematics 2025-01-20 Yuhao Wang , Seong-Hee Kim , Enlu Zhou

High-level Stream Processing: A Complementary Analysis of Fault Recovery

Parallel computing is very important to accelerate the performance of software systems. Additionally, considering that a recurring challenge is to process high data volumes continuously, stream processing emerged as a paradigm and software…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-14 Adriano Vogel , Sören Henning , Esteban Perez-Wohlfeil , Otmar Ertl , Rick Rabiser

Patterns for the waiting time in the context of discrete-time stochastic processes

The aim of this study is to extend the scope and applicability of the level-crossing method to discrete-time stochastic processes and generalize it to enable us to study multiple discrete-time stochastic processes. In previous versions of…

Data Analysis, Statistics and Probability · Physics 2016-09-15 Tayeb Jamali , G. R. Jafari , S. Vasheghani Farahani

FTHP-MPI: Towards Providing Replication-based Fault Tolerance in a Fault-Intolerant Native MPI Library

Faults in high-performance systems are expected to be very large in the current exascale computing era. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a much higher…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-15 Sarthak Joshi , Sathish Vadhiyar

Optimal Checkpoint Interval with Availability as an Objective Function

We present a simplified derivation of the optimal checkpoint interval in Young_1974 [1]. The optimal checkpoint interval derivation in [1] is based on minimizing the total lost time as an objective-function. Lost time is a function of…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-25 Nirmal Raj Saxena , Saurabh Hukerikar , Mikolaj Blaz , Swapna Raj

Chiron: Optimizing Fault Tolerance in QoS-aware Distributed Stream Processing Jobs

Fault tolerance is a property which needs deeper consideration when dealing with streaming jobs requiring high levels of availability and low-latency processing even in case of failures where Quality-of-Service constraints must be adhered…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-12 Morgan Geldenhuys , Lauritz Thamsen , Odej Kao

VELOC: VEry Low Overhead Checkpointing in the Age of Exascale

Checkpointing large amounts of related data concurrently to stable storage is a common I/O pattern of many HPC applications. However, such a pattern frequently leads to I/O bottlenecks that lead to poor scalability and performance. As…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-04 Bogdan Nicolae , Adam Moody , Gregory Kosinovsky , Kathryn Mohror , Franck Cappello

DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-performance computing (HPC) infrastructures and ingest massive amounts of input data. Unsurprisingly, at such a large scale, unexpected events (e.g.,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-18 Avinash Maurya , Robert Underwood , M. Mustafa Rafique , Franck Cappello , Bogdan Nicolae

Optimal Iterative Threshold-Kernel Estimation of Jump Diffusion Processes

In this paper, we propose a new threshold-kernel jump-detection method for jump-diffusion processes, which iteratively applies thresholding and kernel methods in an approximately optimal way to achieve improved finite-sample performance. We…

Statistics Theory · Mathematics 2020-04-07 José E. Figueroa-López , Cheng Li , Jeffrey Nisen