Related papers: Optimal Checkpoint Interval with Availability as a…
State-of-the-art distributed stream processing systems such as Apache Flink and Storm have recently included checkpointing to provide fault-tolerance for stateful applications. This is a necessary eventuality as these systems head into the…
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to…
The machine learning literature contains several constructions for prediction intervals that are intuitively reasonable but ultimately ad-hoc in that they do not come with provable performance guarantees. We present methods from the…
Selecting optimal intervals of checkpointing an application is important for minimizing the run time of the application in the presence of system failures. Most of the existing efforts on checkpointing interval selection were developed for…
Tasks that require information about the world imply a trade-off between the time spent on observation and the variance of the response. In particular, fast decisions need to rely on uncertain information. However, standard estimates of…
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks…
State-of-the-art stream processing platforms make use of checkpointing to support fault tolerance, where a "checkpoint tuple" flows through the topology to all operators, indicating a checkpoint and triggering a checkpoint operation. The…
Many tasks are subject to failure before completion. Two of the most common failure recovery strategies are restart and checkpointing. Under restart, once a failure occurs, it is restarted from the beginning. Under checkpointing, the task…
Optimization is an important module of modern machine learning applications. Tremendous efforts have been made to accelerate optimization algorithms. A common formulation is achieving a lower loss at a given time. This enables a…
We investigate a single machine rescheduling problem that arises from an unexpected machine unavailability, after the given set of jobs has already been scheduled to minimize the total weighted completion time. Such a disruption is…
This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical first-order analysis of Young and Daly in the presence of a fault prediction system, characterized by its recall and its…
We study the problem of conformal prediction in a novel online framework that directly optimizes efficiency. In our problem, we are given a target miscoverage rate $\alpha > 0$, and a time horizon $T$. On each day $t \le T$ an algorithm…
Change point estimation in its offline version is traditionally performed by optimizing over the data set of interest, by considering each data point as the true location parameter and computing a data fit criterion. Subsequently, the data…
While the design of optimal peak-to-peak controllers/observers for linear systems is known to be a difficult problem, this problem becomes interestingly much easier in the context of interval observers because of the positive nature of the…
This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical analysis of Young and Daly in the presence of a fault prediction system, which is characterized by its recall and its…
This paper investigates a novel offline change-point detection problem from an information-theoretic perspective. In contrast to most related works, we assume that the knowledge of the underlying pre- and post-change distributions are not…
We introduce and study two new inferential challenges associated with the sequential detection of change in a high-dimensional mean vector. First, we seek a confidence interval for the changepoint, and second, we estimate the set of indices…
This paper considers online convex optimization with time-varying constraint functions. Specifically, we have a sequence of convex objective functions $\{f_t(x)\}_{t=0}^{\infty}$ and convex constraint functions…
Inversion and PDE-constrained optimization problems often rely on solving the adjoint problem to calculate the gradient of the objec- tive function. This requires storing large amounts of intermediate data, setting a limit to the largest…
Missing data is pervasive in econometric applications, and rarely is it plausible that the data are missing (completely) at random. This paper proposes a methodology for studying the robustness of results drawn from incomplete datasets.…