Related papers: Learning Everywhere: Pervasive Machine Learning fo…
We recently outlined the vision of "Learning Everywhere" which captures the possibility and impact of how learning methods and traditional HPC methods can be coupled together. A primary driver of such coupling is the promise that Machine…
Traditional simulations on High-Performance Computing (HPC) systems typically involve modeling very large domains and/or very complex equations. HPC systems allow running large models, but limits in performance increase that have become…
Machine learning-based performance models are increasingly being used to build critical job scheduling and application optimization decisions. Traditionally, these models assume that data distribution does not change as more samples are…
We explore the idea of integrating machine learning (ML) with high performance computing (HPC)-driven simulations to address challenges in using simulations to teach computational science and engineering courses. We demonstrate that a ML…
Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and…
The growing demands of the worldwide IT infrastructure stress the need for reduced power consumption, which is addressed in so-called transprecision computing by improving energy efficiency at the expense of precision. For example, reducing…
With the growing complexity of computational and experimental facilities, many scientific researchers are turning to machine learning (ML) techniques to analyze large scale ensemble data. With complexities such as multi-component workflows,…
Growing interest in Artificial Intelligence (AI) has resulted in a surge in demand for faster methods of Machine Learning (ML) model training and inference. This demand for speed has prompted the use of high performance computing (HPC)…
As a broader set of applications from simulations to data analysis and machine learning require more parallel computational capability, the demand for interactive and urgent high performance computing (HPC) continues to increase. This paper…
Hardware support for high-performance computing (HPC) has so far been subject to significant advances. The pervasiveness of HPC systems, mainly made up with parallel computing units, makes it crucial to spread and vivify effective HPC…
High-performance computing (HPC) has evolved over decades through multiple architectural transitions, from vector supercomputers to massively parallel CPU clusters and GPU-accelerated systems, continuously expanding the frontier of…
Increasingly, scientific discovery requires sophisticated and scalable workflows. Workflows have become the ``new applications,'' wherein multi-scale computing campaigns comprise multiple and heterogeneous executable tasks. In particular,…
High Performance Distributed Computing is essential to boost scientific progress in many areas of science and to efficiently deploy a number of complex scientific applications. These applications have different characteristics that require…
Recently, language models (LMs), especially large language models (LLMs), have revolutionized the field of deep learning. Both encoder-decoder models and prompt-based techniques have shown immense potential for natural language processing…
Nowadays, we are to find out solutions to huge computing problems very rapidly. It brings the idea of parallel computing in which several machines or processors work cooperatively for computational tasks. In the past decades, there are a…
Can cloud computing infrastructures provide HPC-competitive performance for scientific applications broadly? Despite prolific related literature, this question remains open. Answers are crucial for designing future systems and democratizing…
Let's HPC (www.letshpc.org) is an open-access online platform to supplement conventional classroom oriented High Performance Computing (HPC) and Parallel & Distributed Computing (PDC) education. The web based platform provides online…
High Performance Computing (HPC), Artificial Intelligence (AI)/Machine Learning (ML), and Quantum Computing (QC) and communications offer immense opportunities for innovation and impact on society. Researchers in these areas depend on…
Heterogeneous scientific workflows consist of numerous types of tasks that require executing on heterogeneous resources. Asynchronous execution of those tasks is crucial to improve resource utilization, task throughput and reduce workflows'…
In this work, system monitoring and analysis are discussed in terms of their significance and benefits for operations and research in the field of high-performance computing (HPC). HPC systems deliver unique insights to computational…