English
Related papers

Related papers: A Pattern Language for High-Performance Computing …

200 papers

With the growing scale and complexity of high-performance computing (HPC) systems, resilience solutions that ensure continuity of service despite frequent errors and component failures must be methodically designed to balance the…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-10 Saurabh Hukerikar , Christian Engelmann

Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. While the HPC community has developed various resilience solutions, the solution space remains fragmented. There are no formal methods and…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-01 Saurabh Hukerikar , Christian Engelmann

In this document, we develop a structured approach to the management of HPC resilience based on the concept of resilience-based design patterns. A design pattern is a general repeatable solution to a commonly occurring problem. We identify…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-12-30 Saurabh Hukerikar , Christian Engelmann

Resiliency is the ability of large-scale high-performance computing (HPC) applications to gracefully handle errors, and recover from failures. In this paper, we propose a pattern-based approach to constructing resilience solutions that…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-23 Rizwan A. Ashraf , Saurabh Hukerikar , Christian Engelmann

Future exascale high-performance computing (HPC) systems will be constructed from VLSI devices that will be less reliable than those used today, and faults will become the norm, not the exception. This will pose significant problems for…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-05-24 Saurabh Hukerikar , Robert F. Lucas

High Performance Computing (HPC) aims at providing reasonably fast computing solutions to scientific and real life problems. The advent of multicore architectures is noticeable in the HPC history, because it has brought the underlying…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-07 Claude Tadonki

In recent years, language models (LMs), such as GPT-4, have been widely used in multiple domains, including natural language processing, visualization, and so on. However, applying them for analyzing and optimizing high-performance…

Machine Learning · Computer Science 2023-11-28 Le Chen , Pei-Hung Lin , Tristan Vanderbruggen , Chunhua Liao , Murali Emani , Bronis de Supinski

Understanding the application resilience in the presence of faults is critical to address the HPC resilience challenge. Currently, we largely rely on random fault injection (RFI) to quantify the application resilience. However, RFI provides…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-05-02 Luanzheng Guo , Hanlin He , Dong Li

As high-performance computing systems scale in size and computational power, the danger of silent errors, i.e., errors that can bypass hardware detection mechanisms and impact application state, grows dramatically. Consequently,…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-09-06 Luanzheng Guo , Dong Li , Ignacio Laguna , Martin Schulz

Failure rates in high performance computers rapidly increase due to the growth in system size and complexity. Hence, failures became the norm rather than the exception. Different approaches on high performance computing (HPC) systems have…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-06-15 Siavash Ghiasvand , Florina M. Ciorba

High-performance computing (HPC) is essential for tackling complex computational problems across various domains. As the scale and complexity of HPC applications continue to grow, the need for scalable systems and software architectures…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-21 Risshab Srinivas Ramesh

Computing systems have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort.…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-02-10 Michel Steuwer , Christian Fensch , Christophe Dubach

Parallel programs in high performance computing (HPC) continue to grow in complexity and scale in the exascale era. The diversity in hardware and parallel programming models make developing, optimizing, and maintaining parallel software…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-15 Daniel Nichols , Aniruddha Marathe , Harshitha Menon , Todd Gamblin , Abhinav Bhatele

Generative AI, in particular large transformer models, are increasingly driving HPC system design in science and industry. We analyze performance characteristics of such transformer models and discuss their sensitivity to the transformer…

To design peer-to-peer (P2P) software systems is a challenging task, because of their highly decentralized nature, which may cause unexpected emergent global behaviors. The last fifteen years have seen many P2P applications to come out and…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-07-14 Michele Amoretti , Francesco Zanichelli

Large Language Models (LLMs), including the LLaMA model, have exhibited their efficacy across various general-domain natural language processing (NLP) tasks. However, their performance in high-performance computing (HPC) domain tasks has…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-23 Xianzhong Ding , Le Chen , Murali Emani , Chunhua Liao , Pei-Hung Lin , Tristan Vanderbruggen , Zhen Xie , Alberto E. Cerpa , Wan Du

Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. HPC applications typically tolerate fail-stop failures by checkpointing. Another…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-06-22 Erlin Yao , Mingyu Chen , Rui Wang , Wenli Zhang , Guangming Tan

The use of High Performance Computing (HPC) to compliment urgent decision making in the event of disasters is an important future potential use of supercomputers. However, the usage modes involved are rather different from how HPC has been…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-06 Gordon Gibb , Rupert Nash , Nick Brown , Bianca Prodan

As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that they will experience excessive failure rates. For this reason, detecting and classifying faults in HPC systems as they occur and initiating…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-12 Alessio Netti , Zeynep Kiziltan , Ozalp Babaoglu , Alina Sirbu , Andrea Bartolini , Andrea Borghesi

Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 Michael Treaster
‹ Prev 1 2 3 10 Next ›