Related papers: Run-time application migration using checkpoint/re…
Creating new materials, discovering new drugs, and simulating systems are essential processes for research and innovation and require substantial computational power. While many applications can be split into many smaller independent tasks,…
The serverless and functions as a service (FaaS) paradigms are currently trending among cloud providers and are now increasingly being applied to the network edge, and to the Internet of Things (IoT) devices. The benefits include reduced…
We demonstrate a fully functional implementation of (per-user) checkpoint, restore, and live migration capabilities for JupyterHub platforms. Checkpointing -- the ability to freeze and suspend to disk the running state (contents of memory,…
CheckSync provides applications with high availability via runtime-integrated checkpointing. This allows CheckSync to take checkpoints of a process running in a memory-managed language (Go, for now), which can be resumed on another machine…
Distributed applications running on a large cluster environment, such as the cloud instances will have shorter execution time. However, the application might suffer from sudden termination due to unpredicted computing node failures, thus…
Edge computing addresses critical limitations of cloud computing such as high latency and network congestion by decentralizing processing from cloud to the edge. However, the need for software replication across heterogeneous edge devices…
Hardware-based Trusted Execution Environments (TEEs) are becoming increasingly prevalent in cloud computing, forming the basis for confidential computing. However, the security goals of TEEs sometimes conflict with existing cloud…
One of the major challenges in using extreme scale systems efficiently is to mitigate the impact of faults. Application-level checkpoint/restart (CR) methods provide the best trade-off between productivity, robustness, and performance.…
Grid computing is a collection of computer resources that are gathered together from various areas to give computational resources such as storage, data or application services. This is to permit clients to access this huge measure of…
Checkpointing is an indispensable technique to provide fault tolerance for long-running high-throughput applications like those running on desktop grids. This paper argues that a dedicated checkpoint storage system, optimized to operate in…
Reducing energy consumption is essential to lessen greenhouse gas emissions, conserve natural resources, and help mitigate the impacts of climate change. In this direction, edge computing, a complementary technology to cloud computing,…
Deep learning training at scale is resource-intensive and time-consuming, often running across hundreds or thousands of GPUs for weeks or months. Efficient checkpointing is crucial for running these workloads, especially in multi-tenant…
Offloading computationally intensive tasks from mobile users (MUs) to a virtualized environment such as containers on a nearby edge server, can significantly reduce processing time and hence end-to-end (E2E) delay. However, when users are…
In recent years, data-intensive applications have been increasingly deployed on cloud systems. Such applications utilize significant compute, memory, and I/O resources to process large volumes of data. Optimizing the performance and…
Checkpointing is a cornerstone of data-flow reversal in adjoint algorithmic differentiation. Checkpointing is a storage/recomputation trade-off that can be applied at different levels, one of which being the call tree. We are looking for…
Volunteer Computing, sometimes called Public Resource Computing, is an emerging computational model that is very suitable for work-pooled parallel processing. As more complex grid applications make use of work flows in their design and…
At the edge, there is a high level of similarity in computing. One approach that has been proposed to enhance the efficiency of edge computing is computation reuse, which eliminates redundant computations. Edge computing is integrated with…
The reliability of concurrent and distributed systems often depends on some well-known techniques for fault tolerance. One such technique is based on checkpointing and rollback recovery. Checkpointing involves processes to take snapshots of…
In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and…
The pace and volume of code churn necessary to evolve modern software systems present challenges for analyzing the performance impact of any set of code changes. Traditional methods used in performance analysis rely on extensive data…