Related papers: Accelerating Pythonic coupled cluster implementati…
In this work, we introduce new batching algorithms to effectively handle large contractions encountered in coupled-cluster singles and doubles (CCSD) implementations in Python on the Video Random Access Memory (VRAM) of graphical processing…
The main objective of this work consists in analyzing sub-structuring method for the parallel solution of sparse linear systems with matrices arising from the discretization of partial differential equations such as finite element, finite…
Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in…
To execute scientific computing programs such as deep learning at high speed, GPU acceleration is a powerful option. With the recent advancements in web technologies, interfaces like WebGL and WebGPU, which utilize GPUs on the client side…
Gaussian processes (GPs) are a widely used regression tool, but the cubic complexity of exact solvers limits their scalability. To address this challenge, we extend the GPRat library by incorporating a fully GPU-resident GP prediction…
Comprehending the performance bottlenecks at the core of the intricate hardware-software interactions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI…
Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…
Recently, successes have been achieved for the high-order gas-kinetic schemes (HGKS) on unstructured meshes for compressible flows. In this paper, to accelerate the computation, HGKS is implemented with the graphical processing unit (GPU)…
We present an efficient implementation for running three-dimensional numerical simulations of fluid-structure interaction problems on single GPUs, based on Nvidia CUDA through Numba and Python. The incompressible flow around moving bodies…
Nowadays, the paradigm of parallel computing is changing. CUDA is now a popular programming model for general purpose computations on GPUs and a great number of applications were ported to CUDA obtaining speedups of orders of magnitude…
Unstructured mesh tallies are a bottleneck in Monte Carlo neutral particle transport simulations of fusion reactors. This paper introduces the PUMI-Tally library that takes advantage of mesh adjacency information to accelerate these tallies…
Convex clustering is a popular clustering model without requiring the number of clusters as prior knowledge. It can generate a clustering path by continuously solving the model with a sequence of regularization parameter values. This paper…
High fidelity Computational Fluid Dynamics simulations are generally associated with large computing requirements, which are progressively acute with each new generation of supercomputers. However, significant research efforts are required…
In this work, we examine the performance, energy efficiency and usability when using Python for developing HPC codes running on the GPU. We investigate the portability of performance and energy efficiency between CUDA and OpenCL; between…
This paper presents, to the author's knowledge, the first graphics processing unit (GPU) accelerated program that solves the evolution of interacting scalar fields in an expanding universe. We present the implementation in NVIDIA's Compute…
This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for…
Using GPU-based HPC platforms efficiently for coupled cluster computations is a challenge due to heterogeneous hardware structures. The constant need to adapt software to these structures and the required man-hours makes a systematization…
We accelerated an ab-initio molecular QMC calculation by using GPGPU. Only the bottle-neck part of the calculation is replaced by CUDA subroutine and performed on GPU. The performance on a (single core CPU + GPU) is compared with that on a…
The convex hull is a fundamental geometrical structure for many applications where groups of points must be enclosed or represented by a convex polygon. Although efficient sequential convex hull algorithms exist, and are constantly being…
This paper introduces and evaluates a freely available cellular nonlinear network simulator optimized for the effective use of GPUs, to achieve fast modelling and simulations. Its relevance is demonstrated for several applications in…