Related papers: Multi-GPU Performance Optimization of a CFD Code u…

An Improved Framework of GPU Computing for CFD Applications on Structured Grids using OpenACC

This paper is focused on improving multi-GPU performance of a research CFD code on structured grids. MPI and OpenACC directives are used to scale the code up to 16 GPUs. This paper shows that using 16 P100 GPUs and 16 V100 GPUs can be…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-10 Weicheng Xue , Charles W. Jackson , Christoper J. Roy

Improving Scalability with GPU-Aware Asynchronous Tasks

Asynchronous tasks, when created with over-decomposition, enable automatic computation-communication overlap which can substantially improve performance and scalability. This is not only applicable to traditional CPU-based systems, but also…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-23 Jaemin Choi , David F. Richards , Laxmikant V. Kale

Adaptive Multidimensional Quadrature on Multi-GPU Systems

We introduce a distributed adaptive quadrature method that formulates multidimensional integration as a hierarchical domain decomposition problem on multi-GPU architectures. The integration domain is recursively partitioned into subdomains…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-04 Melanie Tonarelli , Simone Riva , Pietro Benedusi , Fabrizio Ferrandi , Rolf Krause

OpenACC offloading of the MFC compressible multiphase flow solver on AMD and NVIDIA GPUs

GPUs are the heart of the latest generations of supercomputers. We efficiently accelerate a compressible multiphase flow solver via OpenACC on NVIDIA and AMD Instinct GPUs. Optimization is accomplished by specifying the directive clauses…

Fluid Dynamics · Physics 2025-05-20 Benjamin Wilfong , Anand Radhakrishnan , Henry A. Le Berre , Steve Abbott , Reuben D. Budiardja , Spencer H. Bryngelson

Design and optimization of a portable LQCD Monte Carlo code using OpenACC

The present panorama of HPC architectures is extremely heterogeneous, ranging from traditional multi-core CPU processors, supporting a wide class of applications but delivering moderate computing performance, to many-core GPUs, exploiting…

High Energy Physics - Lattice · Physics 2017-05-09 Claudio Bonati , Enrico Calore , Simone Coscetti , Massimo D'Elia , Michele Mesiti , Francesco Negro , Sebastiano Fabio Schifano , Giorgio Silvi , Raffaele Tripiccione

Parallel Paradigms in Modern HPC: A Comparative Analysis of MPI, OpenMP, and CUDA

This paper presents a comprehensive comparison of three dominant parallel programming models in High Performance Computing (HPC): Message Passing Interface (MPI), Open Multi-Processing (OpenMP), and Compute Unified Device Architecture…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-19 Nizar ALHafez , Ahmad Kurdi

JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization

The rapid development in computing technology has paved the way for directive-based programming models towards a principal role in maintaining software portability of performance-critical applications. Efforts on such models involve a least…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-04-28 Kazuaki Matsumura , Simon Garcia De Gonzalo , Antonio J. Peña

Improved Multi-GPU parallelization of a Lagrangian Transport Model

This report highlights our work on improving GPU parallelization by supporting compute nodes with multiple GPUs. However, since the default support for multi-GPUs in OpenACC is limited[6], the current implementation allows each MPI process…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-24 Saheed Bolarinwa

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achievable application…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-02 Gabin Schieffer , Ruimin Shi , Stefano Markidis , Andreas Herten , Jennifer Faj , Ivy Peng

Massive parallelization and performance enhancement of an immersed boundary method based unsteady flow solver

High-fidelity simulations of unsteady fluid flow are now possible with advancements in high-performance computing hardware and software frameworks. Since computational fluid dynamics (CFD) computations are dominated by linear algebraic…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-02-28 Rahul Sundar , Dipanjan Majumdar , Chhote Lal Shah , Sunetra Sarkar

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Hardware-aware design and optimization is crucial in exploiting emerging architectures for PDE-based computational fluid dynamics applications. In this work, we study optimizations aimed at acceleration of OpenFOAM-based applications on…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-05-29 Amani AlOnazi , David Keyes , Alexey Lastovetsky , Vladimir Rychkov

Performance and Portability of Accelerated Lattice Boltzmann Applications with OpenACC

An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems has been troublesome in the past as…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-03-02 E. Calore , A. Gabbana , J. Kraus , S. F. Schifano , R. Tripiccione

Evaluation of Programming Models and Performance for Stencil Computation on Current GPU Architectures

Accelerated computing is widely used in high-performance computing. Therefore, it is crucial to experiment and discover how to better utilize GPUGPUs latest generations on relevant applications. In this paper, we present results and share…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-13 Baodi Shan , Mauricio Araya-Polo

Method for portable, scalable, and performant GPU-accelerated simulation of multiphase compressible flow

Multiphase compressible flows are often characterized by a broad range of space and time scales. Thus entailing large grids and small time steps, simulations of these flows on CPU-based clusters can thus take several wall-clock days.…

Fluid Dynamics · Physics 2024-05-20 Anand Radhakrishnan , Henry Le Berre , Benjamin Wilfong , Jean-Sebastien Spratt , Mauro Rodriguez , Tim Colonius , Spencer H. Bryngelson

GPU Implementation and Optimization of a Flexible MAP Decoder for Synchronization Correction

In this paper we present an optimized parallel implementation of a flexible MAP decoder for synchronization error correcting codes, supporting a very wide range of code sizes and channel conditions. On mid-range GPUs we demonstrate decoding…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-26 Johann A. Briffa

Scalable communication for high-order stencil computations using CUDA-aware MPI

Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths,…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-11 Johannes Pekkilä , Miikka S. Väisälä , Maarit J. Käpylä , Matthias Rheinhardt , Oskar Lappi

Multi-GPU thermal lattice Boltzmann simulations using OpenACC and MPI

We assess the performance of the hybrid Open Accelerator (OpenACC) and Message Passing Interface (MPI) approach for multi-graphics processing units (GPUs) accelerated thermal lattice Boltzmann (LB) simulation. The OpenACC accelerates…

Fluid Dynamics · Physics 2022-11-21 Ao Xu , Bo-Tao Li

Measuring and comparing the scaling behaviour of a high-performance CFD code on different supercomputing infrastructures

Parallel code design is a challenging task especially when addressing petascale systems for massive parallel processing (MPP), i.e. parallel computations on several hundreds of thousands of cores. An in-house computational fluid dynamics…

Performance · Computer Science 2018-07-03 Jérôme Frisch , Ralf-Peter Mundani

CMD: A Cache-assisted GPU Memory Deduplication Architecture

Massive off-chip accesses in GPUs are the main performance bottleneck, and we divided these accesses into three types: (1) Write, (2) Data-Read, and (3) Read-Only. Besides, We find that many writes are duplicate, and the duplication can be…

Hardware Architecture · Computer Science 2024-08-20 Wei Zhao , Dan Feng , Wei Tong , Xueliang Wei , Bing Wu

A Hybrid Multi-GPU Implementation of Simplex Algorithm with CPU Collaboration

The simplex algorithm has been successfully used for many years in solving linear programming (LP) problems. Due to the intensive computations required (especially for the solution of large LP problems), parallel approaches have also…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-22 Basilis Mamalis , Marios Perlitis