Related papers: Accelerating Pythonic coupled cluster implementati…

Efficient Coupled-Cluster Python Frameworks for Next-Generation GPUs: A Comparative Study of CuPy and PyTorch on the Hopper and Grace Hopper Architecture

In this work, we introduce new batching algorithms to effectively handle large contractions encountered in coupled-cluster singles and doubles (CCSD) implementations in Python on the Video Random Access Memory (VRAM) of graphical processing…

Chemical Physics · Physics 2026-03-24 Antonina Dobrowolska , Julian Świerczyński , Paweł Tecmer , Emil Sujkowski , Somayeh Ahmadkhani , Grzegorz Mazur , Klemens Noga , Jeff Hammond , Katharina Boguslawski

Parallel Sub-Structuring Methods for solving Sparse Linear Systems on a cluster of GPU

The main objective of this work consists in analyzing sub-structuring method for the parallel solution of sparse linear systems with matrices arising from the discretization of partial differential equations such as finite element, finite…

Numerical Analysis · Mathematics 2021-08-31 Abal-Kassim Cheik Ahamed , Frédéric Magoulès

A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps

Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in…

Hardware Architecture · Computer Science 2016-02-04 Nandita Vijaykumar , Gennady Pekhimenko , Adwait Jog , Saugata Ghose , Abhishek Bhowmick , Rachata Ausavarangnirun , Chita Das , Mahmut Kandemir , Todd C. Mowry , Onur Mutlu

WgPy: GPU-accelerated NumPy-like array library for web browsers

To execute scientific computing programs such as deep learning at high speed, GPU acceleration is a powerful option. With the recent advancements in web technologies, interfaces like WebGL and WebGPU, which utilize GPUs on the client side…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-04 Masatoshi Hidaka , Tatsuya Harada

GPU-Resident Gaussian Process Regression Leveraging Asynchronous Tasks with HPX

Gaussian processes (GPs) are a widely used regression tool, but the cubic complexity of exact solvers limits their scalability. To address this challenge, we extend the GPRat library by incorporating a fully GPU-resident GP prediction…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-24 Henrik Möllmann , Dirk Pflüger , Alexander Strack

Making Applications Faster by Asynchronous Execution: Slowing Down Processes or Relaxing MPI Collectives

Comprehending the performance bottlenecks at the core of the intricate hardware-software interactions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-06 Ayesha Afzal , Georg Hager , Stefano Markidis , Gerhard Wellein

Accelerating Matrix Multiplication: A Performance Comparison Between Multi-Core CPU and GPU

Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-30 Mufakir Qamar Ansari , Mudabir Qamar Ansari

Multiple-GPU accelerated high-order gas-kinetic scheme on three-dimensional unstructured meshes

Recently, successes have been achieved for the high-order gas-kinetic schemes (HGKS) on unstructured meshes for compressible flows. In this paper, to accelerate the computation, HGKS is implemented with the graphical processing unit (GPU)…

Numerical Analysis · Mathematics 2024-07-02 Yuhang Wang , Waixiang Cao , Liang Pan

A Python-based flow solver for numerical simulations using an immersed boundary method on single GPUs

We present an efficient implementation for running three-dimensional numerical simulations of fluid-structure interaction problems on single GPUs, based on Nvidia CUDA through Numba and Python. The incompressible flow around moving bodies…

Fluid Dynamics · Physics 2024-12-05 M. Guerrero-Hurtado , J. M. Catalán , M. Moriche , A. Gonzalo , O. Flores

Developing a High Performance Software Library with MPI and CUDA for Matrix Computations

Nowadays, the paradigm of parallel computing is changing. CUDA is now a popular programming model for general purpose computations on GPUs and a great number of applications were ported to CUDA obtaining speedups of orders of magnitude…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-09 Bogdan Oancea , Tudorel Andrei

GPU Acceleration of Monte Carlo Tallies on Unstructured Meshes in OpenMC with PUMI-Tally

Unstructured mesh tallies are a bottleneck in Monte Carlo neutral particle transport simulations of fusion reactors. This paper introduces the PUMI-Tally library that takes advantage of mesh adjacency information to accelerate these tallies…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-29 Fuad Hasan , Cameron W. Smith , Mark S. Shephard , R. Michael Churchill , George J. Wilkie , Paul K. Romano , Patrick C. Shriwise , Jacob S. Merson

PyClustrPath: An efficient Python package for generating clustering paths with GPU acceleration

Convex clustering is a popular clustering model without requiring the number of clusters as prior knowledge. It can generate a clustering path by continuously solving the model with a sequence of regularization parameter values. This paper…

Optimization and Control · Mathematics 2025-01-28 Hongfei Wu , Yancheng Yuan

Heterogeneous CPU/GPU co-execution of CFD simulations on the POWER9 architecture: Application to airplane aerodynamics

High fidelity Computational Fluid Dynamics simulations are generally associated with large computing requirements, which are progressively acute with each new generation of supercomputers. However, significant research efforts are required…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-07 R. Borrell , D. Dosimont , M. Garcia-Gasulla , G. Houzeaux , O. Lehmkuhl , V. Mehta , H. Owen , M. Vazquez , G. Oyarzun

GPU Computing with Python: Performance, Energy Efficiency and Usability

In this work, we examine the performance, energy efficiency and usability when using Python for developing HPC codes running on the GPU. We investigate the portability of performance and energy efficiency between CUDA and OpenCL; between…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-03-11 Håvard H. Holm , André R. Brodtkorb , Martin L. Sætra

CUDAEASY - a GPU Accelerated Cosmological Lattice Program

This paper presents, to the author's knowledge, the first graphics processing unit (GPU) accelerated program that solves the evolution of interacting scalar fields in an expanding universe. We present the implementation in NVIDIA's Compute…

Instrumentation and Methods for Astrophysics · Physics 2014-11-20 Jani Sainio

A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels

This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-09 Peng Chen , Mohamed Wahib , Shinichiro Takizawa , Ryousei Takano , Satoshi Matsuoka

Generating coupled cluster code for modern distributed memory tensor software

Using GPU-based HPC platforms efficiently for coupled cluster computations is a challenge due to heterogeneous hardware structures. The constant need to adapt software to these structures and the required man-hours makes a systematization…

Chemical Physics · Physics 2025-10-07 Jan Brandejs , Johann Pototschnig , Trond Saue

Acceleration of a QM/MM-QMC simulation using GPU

We accelerated an ab-initio molecular QMC calculation by using GPGPU. Only the bottle-neck part of the calculation is replaced by CUDA subroutine and performed on GPU. The performance on a (single core CPU + GPU) is compared with that on a…

Computational Physics · Physics 2012-04-06 Yutaka Uejima , Tomoharu Terashima , Ryo Maezono

Accelerating the Convex Hull Computation with a Parallel GPU Algorithm

The convex hull is a fundamental geometrical structure for many applications where groups of points must be enclosed or represented by a convex polygon. Although efficient sequential convex hull algorithms exist, and are constantly being…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-27 Alan Keith , Héctor Ferrada , Cristóbal A. Navarro

A Python Framework for Fast Modelling and Simulation of Cellular Nonlinear Networks and other Finite-difference Time-domain Systems

This paper introduces and evaluates a freely available cellular nonlinear network simulator optimized for the effective use of GPUs, to achieve fast modelling and simulations. Its relevance is demonstrated for several applications in…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-23 Radu Dogaru , Ioana Dogaru