Related papers: Optimal Multi-Level Interval-based Checkpointing f…
State-of-the-art distributed stream processing systems such as Apache Flink and Storm have recently included checkpointing to provide fault-tolerance for stateful applications. This is a necessary eventuality as these systems head into the…
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks…
Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. HPC applications typically tolerate fail-stop failures by checkpointing. Another…
Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing…
Volunteer computing is being used successfully for large scale scientific computations. This research is in the context of Volpex, a programming framework that supports communicating parallel processes in a volunteer environment. Redundancy…
Selecting optimal intervals of checkpointing an application is important for minimizing the run time of the application in the presence of system failures. Most of the existing efforts on checkpointing interval selection were developed for…
Realistic simulations in engineering or in the materials sciences can consume enormous computing resources and thus require the use of massively parallel supercomputers. The probability of a failure increases both with the runtime and with…
Fault tolerance for the upcoming exascale generation has long been an area of active research. One of the components of a fault tolerance strategy is checkpointing. Petascale-level checkpointing is demonstrated through a new mechanism for…
The accuracy of large language models (LLMs) improves with increasing model size, but increasing model complexity also poses significant challenges to training stability. Periodic checkpointing is a key mechanism for fault recovery and is…
Stream processing in the last decade has seen broad adoption in both commercial and research settings. One key element for this success is the ability of modern stream processors to handle failures while ensuring exactly-once processing…
This short paper deals with parallel scientific applications using non-blocking and periodic coordinated checkpointing to enforce resilience. We provide a model and detailed formulas for total execution time and consumed energy. We…
In this paper, we study a fixed-confidence, fixed-tolerance formulation of a class of stochastic bi-level optimization problems, where the upper-level problem selects from a finite set of systems based on a performance metric, and the…
Parallel computing is very important to accelerate the performance of software systems. Additionally, considering that a recurring challenge is to process high data volumes continuously, stream processing emerged as a paradigm and software…
The aim of this study is to extend the scope and applicability of the level-crossing method to discrete-time stochastic processes and generalize it to enable us to study multiple discrete-time stochastic processes. In previous versions of…
Faults in high-performance systems are expected to be very large in the current exascale computing era. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a much higher…
We present a simplified derivation of the optimal checkpoint interval in Young_1974 [1]. The optimal checkpoint interval derivation in [1] is based on minimizing the total lost time as an objective-function. Lost time is a function of…
Fault tolerance is a property which needs deeper consideration when dealing with streaming jobs requiring high levels of availability and low-latency processing even in case of failures where Quality-of-Service constraints must be adhered…
Checkpointing large amounts of related data concurrently to stable storage is a common I/O pattern of many HPC applications. However, such a pattern frequently leads to I/O bottlenecks that lead to poor scalability and performance. As…
LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-performance computing (HPC) infrastructures and ingest massive amounts of input data. Unsurprisingly, at such a large scale, unexpected events (e.g.,…
In this paper, we propose a new threshold-kernel jump-detection method for jump-diffusion processes, which iteratively applies thresholding and kernel methods in an approximately optimal way to achieve improved finite-sample performance. We…