English
Related papers

Related papers: Towards Distributed Software Resilience in Asynchr…

200 papers

Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will invariably increase. Therefore, designing…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-16 Nikunj Gupta , Jackson R. Mayo , Adrian S. Lemoine , Hartmut Kaiser

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2008-06-20 George Bosilca , Remi Delmas , Jack Dongarra , Julien Langou

Faults in high-performance systems are expected to be very large in the current exascale computing era. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a much higher…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-15 Sarthak Joshi , Sathish Vadhiyar

Understanding the application resilience in the presence of faults is critical to address the HPC resilience challenge. Currently, we largely rely on random fault injection (RFI) to quantify the application resilience. However, RFI provides…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-05-02 Luanzheng Guo , Hanlin He , Dong Li

Fault tolerance is essential for building reliable services; however, it comes at the price of redundancy, mainly the "replication factor" and "diversity". With the increasing reliance on Internet-based services, more machines (mainly…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-10-04 Ali Shoker

Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. HPC applications typically tolerate fail-stop failures by checkpointing. Another…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-06-22 Erlin Yao , Mingyu Chen , Rui Wang , Wenli Zhang , Guangming Tan

The idle computers on a local area, campus area, or even wide area network represent a significant computational resource---one that is, however, also unreliable, heterogeneous, and opportunistic. This type of resource has been used…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 Adriana Iamnitchi , Ian Foster

OpenMP has been the de facto standard for single node parallelism for more than a decade. Recently, asynchronous many-task runtime (AMT) systems have increased in popularity as a new programming paradigm for high performance computing…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-20 Tianyi Zhang , Shahrzad Shirzad , Bibek Wagle , Adrian S. Lemoine , Patrick Diehl , Hartmut Kaiser

This research paper investigates how machine learning-driven data replication strategies can enhance fault tolerance in large-scale distributed systems. Traditional replication methods, which rely on static configurations, often struggle to…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-18 Almond Kiruthu Murimi

This paper presents a powerful automated framework for making complex systems resilient under failures, by optimized adaptive distribution and replication of interdependent software components across heterogeneous hardware components with…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-13 Scott D. Stoller , Balaji Jayasankar , Yanhong A. Liu

The applications that are deployed in the cloud to provide services to the users encompass a large number of interconnected dependent cloud components. Multiple identical components are scheduled to run concurrently in order to handle…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-12 Chinmaya Kumar Dehury , Prasan Kumar Sahoo , Bharadwaj Veeravalli

As we have entered Exascale computing, the faults in high-performance systems are expected to increase considerably. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-26 Sarthak Joshi , Sathish Vadhiyar

Asynchronous Many-task (AMT) runtime systems have gained increasing acceptance in the HPC community due to the performance improvements offered by fine-grained tasking runtime systems. At the same time, C++ standardization efforts are…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-09 Tianyi Zhang , Shahrzad Shirzad , Patrick Diehl , R. Tohid , Weile Wei , Hartmut Kaiser

Scientific workflows have been predominantly used for complex and large scale data analysis and scientific computation/automation and the need for robust workflow scheduling techniques has grown considerably. But, most of the existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-04 S. Jaya Nirmala , Amrith Rajagopal Setlur , Har Simrat Singh , Sudhanshu Khoriya

Parallel algorithms relying on synchronous parallelization libraries often experience adverse performance due to global synchronization barriers. Asynchronous many-task runtimes offer task futurization capabilities that minimize or remove…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-05 Alexander Strack , Christopher Taylor , Patrick Diehl , Dirk Pflüger

One typical use case of large-scale distributed computing in data centers is to decompose a computation job into many independent tasks and run them in parallel on different machines, sometimes known as the "embarrassingly parallel"…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-04-07 Da Wang , Gauri Joshi , Gregory Wornell

Failures in Task-based Parallel Programming (TBPP) can severely degrade performance and result in incomplete or incorrect outcomes. Existing failure-handling approaches, including reactive, proactive, and resilient methods such as retry and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-31 Sicheng Zhou , Zhuozhao Li , Valérie Hayot-Sasson , Haochen Pan , Maxime Gonthier , J. Gregory Pauloski , Ryan Chard , Kyle Chard , Ian Foster

In large distributed systems, failures are a daily event occurring frequently, especially with growing numbers of computation tasks and locations on which they are deployed. The advantage of representing an application with a workflow is…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-09 Alberto Mulone , Doriana Medić , Marco Aldinucci

Task-based programming models like OmpSs-2 and OpenMP provide a flexible data-flow execution model to exploit dynamic, irregular and nested parallelism. Providing an efficient implementation that scales well with small granularity tasks…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-18 David Álvarez , Kevin Sala , Marcos Maroñas , Aleix Roca , Vicenç Beltran

Serial-parallel redundancy is a reliable way to ensure service and systems will be available in cloud computing. That method involves making copies of the same system or program, with only one remaining active. When an error occurs, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-08 Gutha Jaya Krishna
‹ Prev 1 2 3 10 Next ›