Related papers: A Pattern Language for High-Performance Computing …

Pattern-based Modeling of High-Performance Computing Resilience

With the growing scale and complexity of high-performance computing (HPC) systems, resilience solutions that ensure continuity of service despite frequent errors and component failures must be methodically designed to balance the…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-10 Saurabh Hukerikar , Christian Engelmann

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. While the HPC community has developed various resilience solutions, the solution space remains fragmented. There are no formal methods and…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-01 Saurabh Hukerikar , Christian Engelmann

Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0)

In this document, we develop a structured approach to the management of HPC resilience based on the concept of resilience-based design patterns. A design pattern is a general repeatable solution to a commonly occurring problem. We identify…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-12-30 Saurabh Hukerikar , Christian Engelmann

Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing

Resiliency is the ability of large-scale high-performance computing (HPC) applications to gracefully handle errors, and recover from failures. In this paper, we propose a pattern-based approach to constructing resilience solutions that…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-23 Rizwan A. Ashraf , Saurabh Hukerikar , Christian Engelmann

Rolex: Resilience-Oriented Language Extensions for Extreme-Scale Systems

Future exascale high-performance computing (HPC) systems will be constructed from VLSI devices that will be less reliable than those used today, and faults will become the norm, not the exception. This will pose significant problems for…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-05-24 Saurabh Hukerikar , Robert F. Lucas

Conceptual and Technical Challenges for High Performance Computing

High Performance Computing (HPC) aims at providing reasonably fast computing solutions to scientific and real life problems. The advent of multicore architectures is noticeable in the HPC history, because it has brought the underlying…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-07 Claude Tadonki

LM4HPC: Towards Effective Language Model Application in High-Performance Computing

In recent years, language models (LMs), such as GPT-4, have been widely used in multiple domains, including natural language processing, visualization, and so on. However, applying them for analyzing and optimizing high-performance…

Machine Learning · Computer Science 2023-11-28 Le Chen , Pei-Hung Lin , Tristan Vanderbruggen , Chunhua Liao , Murali Emani , Bronis de Supinski

Application-Level Resilience Modeling for HPC Fault Tolerance

Understanding the application resilience in the presence of faults is critical to address the HPC resilience challenge. Currently, we largely rely on random fault injection (RFI) to quantify the application resilience. However, RFI provides…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-05-02 Luanzheng Guo , Hanlin He , Dong Li

FlipTracker: Understanding Natural Error Resilience in HPC Applications

As high-performance computing systems scale in size and computational power, the danger of silent errors, i.e., errors that can bypass hardware detection mechanisms and impact application state, grows dramatically. Consequently,…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-09-06 Luanzheng Guo , Dong Li , Ignacio Laguna , Martin Schulz

Towards Adaptive Resilience in High Performance Computing

Failure rates in high performance computers rapidly increase due to the growth in system size and complexity. Hence, failures became the norm rather than the exception. Different approaches on high performance computing (HPC) systems have…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-06-15 Siavash Ghiasvand , Florina M. Ciorba

Scalable Systems and Software Architectures for High-Performance Computing on cloud platforms

High-performance computing (HPC) is essential for tackling complex computational problems across various domains. As the scale and complexity of HPC applications continue to grow, the need for scalable systems and software architectures…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-21 Risshab Srinivas Ramesh

Patterns and Rewrite Rules for Systematic Code Generation (From High-Level Functional Patterns to High-Performance OpenCL Code)

Computing systems have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort.…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-02-10 Michel Steuwer , Christian Fensch , Christophe Dubach

HPC-Coder: Modeling Parallel Programs using Large Language Models

Parallel programs in high performance computing (HPC) continue to grow in complexity and scale in the exascale era. The diversity in hardware and parallel programming models make developing, optimizing, and maintaining parallel software…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-15 Daniel Nichols , Aniruddha Marathe , Harshitha Menon , Todd Gamblin , Abhinav Bhatele

Comprehensive Performance Modeling and System Design Insights for Foundation Models

Generative AI, in particular large transformer models, are increasingly driving HPC system design in science and industry. We analyze performance characteristics of such transformer models and discuss their sensitivity to the transformer…

Machine Learning · Computer Science 2024-10-02 Shashank Subramanian , Ermal Rrapaj , Peter Harrington , Smeet Chheda , Steven Farrell , Brian Austin , Samuel Williams , Nicholas Wright , Wahid Bhimji

P2P-PL: A Pattern Language to Design Efficient and Robust Peer-to-Peer Systems

To design peer-to-peer (P2P) software systems is a challenging task, because of their highly decentralized nature, which may cause unexpected emergent global behaviors. The last fifteen years have seen many P2P applications to come out and…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-07-14 Michele Amoretti , Francesco Zanichelli

HPC-GPT: Integrating Large Language Model for High-Performance Computing

Large Language Models (LLMs), including the LLaMA model, have exhibited their efficacy across various general-domain natural language processing (NLP) tasks. However, their performance in high-performance computing (HPC) domain tasks has…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-23 Xianzhong Ding , Le Chen , Murali Emani , Chunhua Liao , Pei-Hung Lin , Tristan Vanderbruggen , Zhen Xie , Alberto E. Cerpa , Wan Du

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. HPC applications typically tolerate fail-stop failures by checkpointing. Another…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-06-22 Erlin Yao , Mingyu Chen , Rui Wang , Wenli Zhang , Guangming Tan

The Technologies Required for Fusing HPC and Real-Time Data to Support Urgent Computing

The use of High Performance Computing (HPC) to compliment urgent decision making in the event of disasters is an important future potential use of supercomputers. However, the usage modes involved are rather different from how HPC has been…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-06 Gordon Gibb , Rupert Nash , Nick Brown , Bianca Prodan

Online Fault Classification in HPC Systems through Machine Learning

As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that they will experience excessive failure rates. For this reason, detecting and classifying faults in HPC systems as they occur and initiating…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-12 Alessio Netti , Zeynep Kiziltan , Ozalp Babaoglu , Alina Sirbu , Andrea Bartolini , Andrea Borghesi

A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 Michael Treaster