English
Related papers

Related papers: stdchk: A Checkpoint Storage System for Desktop Gr…

200 papers

Grid computing is a collection of computer resources that are gathered together from various areas to give computational resources such as storage, data or application services. This is to permit clients to access this huge measure of…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-05 Garba Aliyu , Kana A. F. D. , Abdullahi Mohammed , Idris Abdulmumin , Shehu Adamu , Fatsuma Jauro

High-performance computing (HPC) requires resilience techniques such as checkpointing in order to tolerate failures in supercomputers. As the number of nodes and memory in supercomputers keeps on increasing, the size of checkpoint data also…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-13 Kai Keller , Leonardo Bautista Gomez

CheckSync provides applications with high availability via runtime-integrated checkpointing. This allows CheckSync to take checkpoints of a process running in a memory-managed language (Go, for now), which can be resumed on another machine…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-08 Nicolaas Kaashoek , Robert Morris

Load balancing is critical for distributed storage to meet strict service-level objectives (SLOs). It has been shown that a fast cache can guarantee load balancing for a clustered storage system. However, when the system scales out to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-02-18 Zaoxing Liu , Zhihao Bai , Zhenming Liu , Xiaozhou Li , Changhoon Kim , Vladimir Braverman , Xin Jin , Ion Stoica

Grid Computing is a type of parallel and distributed systems that is designed to provide reliable access to data and computational resources in wide area networks. These resources are distributed in different geographical locations, however…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-09-27 Sheida Dayyani , Mohammad Reza Khayyambashi

One of the major challenges in using extreme scale systems efficiently is to mitigate the impact of faults. Application-level checkpoint/restart (CR) methods provide the best trade-off between productivity, robustness, and performance.…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-02 Marcos Maroñas , Sergi Mateo , Kai Keller , Leonardo Bautista-Gomez , Eduard Ayguadé , Vicenç Beltran

The single-chip crosspoint-queued (CQ) switch is a compact switching architecture that has all its buffers placed at the crosspoints of input and output lines. Scheduling is also performed inside the switching core, and does not rely on…

Networking and Internet Architecture · Computer Science 2014-03-11 Zizhong Cao , Shivendra S. Panwar

In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-08 Faisal Shahzad , Jonas Thies , Moritz Kreutzer , Thomas Zeiser , Georg Hager , Gerhard Wellein

Systematic checkpointing of the machine state makes restart of execution from a safe state possible upon detection of an error. The time and energy overhead of checkpointing, however, grows with the frequency of checkpointing. Amortizing…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-30 Ismail Akturk , Ulya R. Karpuzcu

Caching is crucial for enabling high-throughput networks for data intensive applications. Traditional caching technology relies on DRAM, as it can transfer data at a high rate. However, DRAM capacity is subject to contention by most system…

Networking and Internet Architecture · Computer Science 2023-10-12 Faruk Volkan Mutlu , Edmund Yeh

Spot instances offer a cost-effective solution for applications running in the cloud computing environment. However, it is challenging to run long-running jobs on spot instances because they are subject to unpredictable evictions. Here, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-07 Ashley Tung , Haiyan Wang , Yue Li , Zhong Wang , Jingchao Sun

Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-30 Dingwen Tao , Sheng Di , Xin Liang , Zizhong Chen , Franck Cappello

The storage stack in the traditional operating system is primarily optimized towards improving the CPU utilization and hiding the long I/O latency imposed by the slow I/O devices such as hard disk drivers (HDDs). However, the emerging…

Operating Systems · Computer Science 2023-06-21 Junzhe Li , Xiurui Pan , Shushu Yi , Jie Zhang

We consider the problem of checkpointing a distributed application efficiently in Content Centric Networks so that it can withstand transient failures. We present CCNCheck, a system which enables a sender optimized way of checkpointing…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-06-02 Nitinder Mohan , Pushpendra Singh

State-of-the-art stream processing platforms make use of checkpointing to support fault tolerance, where a "checkpoint tuple" flows through the topology to all operators, indicating a checkpoint and triggering a checkpoint operation. The…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-12-17 Sachini Jayasekara , Aaron Harwood , Shanika Karunasekera

The success of Google's Pregel framework in distributed graph processing has inspired a surging interest in developing Pregel-like platforms featuring a user-friendly "think like a vertex" programming model. Existing Pregel-like systems…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-01-26 Da Yan , James Cheng , Fan Yang

Although every individual invented storage technology made a big step towards perfection, none of them is spotless. Different data store essentials such as performance, availability, and recovery requirements have not met together in a…

Hardware Architecture · Computer Science 2019-04-29 Morteza Hoseinzadeh

The growing demand for efficient cloud storage solutions has led to the widespread adoption of Solid-State Drives (SSDs) for caching in cloud block storage systems. The management of data writes to SSD caches plays a crucial role in…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-30 Chiyu Cheng , Chang Zhou , Yang Zhao , Jin Cao

This paper presents an empirical study on the feasibility of using Checkpoint/Restore In Userspace (CRIU) for run-time application migration between hosts, with a particular focus on edge computing and cloud infrastructures. The paper…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-25 Aleksandar Tošić

The state of the art in Grid style data management is to achieve increased resilience of data via multiple complete replicas of data files across multiple storage endpoints. While this is effective, it is not the most space-efficient…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-01-20 Samuel Cadellin Skipsey , Paulin Todev , David Britton , David Crooks , Gareth Roy
‹ Prev 1 2 3 10 Next ›