Related papers: Gravitational octree code performance evaluation o…
We present a new very fast tree-code which runs on massively parallel Graphical Processing Units (GPU) with NVIDIA CUDA architecture. The tree-construction and calculation of multipole moments is carried out on the host CPU, while the force…
We present parallel algorithms for constructing and traversing sparse octrees on graphics processing units (GPUs). The algorithms are based on parallel-scan and sort methods. To test the performance and feasibility, we implemented them in…
The tree method is a widely implemented algorithm for collisionless $N$-body simulations in astrophysics well suited for GPU(s). Adopting hierarchical time stepping can accelerate $N$-body simulations; however, it is infrequently…
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta…
This paper presents the benchmarking and scaling studies of a GPU accelerated three dimensional compressible magnetohydrodynamic code. The code is developed keeping an eye to explain the large and intermediate scale magnetic field…
This paper focuses on the parallel implementation of a direct $N$-body method~(particle-particle algorithm) and the application of multiple GPUs for galactic dynamics simulations. Application of a hybrid OpenMP-CUDA technology is considered…
We present a gravitational hierarchical N-body code that is designed to run efficiently on Graphics Processing Units (GPUs). All parts of the algorithm are executed on the GPU which eliminates the need for data transfer between the Central…
Performance optimization can be a daunting task especially as the hardware architecture becomes more and more complex. This paper takes a kernel from the Materials Science code BerkeleyGW, and demonstrates a few performance analysis and…
We compare the performance of two very different parallel gravitational $N$-body codes for astrophysical simulations on large GPU clusters, both pioneer in their own fields as well as in certain mutual scales - NBODY6++ and Bonsai. We carry…
Modeling of collisionless galactic systems is based on the N-body model, which requires large computational resources due to the long-range nature of gravitational forces. The most common method for calculating gravity is the TreeCode…
We present the results of gravitational direct $N$-body simulations using the Graphics Processing Unit (GPU) on a commercial NVIDIA GeForce 8800GTX designed for gaming computers. The force evaluation of the $N$-body problem is implemented…
The kd-tree is a fundamental tool in computer science. Among others, an application of the kd-tree search (oct-tree method) to fast evaluation of particle interactions and neighbor search is highly important since computational complexity…
We present a scheme for the parallelization of quantum Monte Carlo on graphical processing units, focusing on bosonic systems and variational Monte Carlo. We use asynchronous execution schemes with shared memory persistence, and obtain an…
Hardware accelerators (such as Nvidia's CUDA GPUs) have tremendous promise for computational science, because they can deliver large gains in performance at relatively low cost. In this work, we focus on the use of Nvidia's Tesla GPU for…
Kepler GTX Titan Black and Kepler Tesla K40 are still the best GPUs for high performance computing, although Maxwell GPUs such as GTX 980 are available in the market. Hence, we measure the performance of our lattice QCD codes using the…
We present a GPU accelerated CUDA-C implementation of the Barnes Hut (BH) tree code for calculating the gravitational potential on octree adaptive meshes. The tree code algorithm is implemented within the FLASH4 adaptive mesh refinement…
I describe here the performances of a parallel treecode with individual particle timesteps. The code is based on the Barnes-Hut algorithm and runs cosmological N-body simulations on parallel machines with a distributed memory architecture…
We continue our presentation of VINE. We begin with a description of relevant architectural properties of the serial and shared memory parallel computers on which VINE is intended to run, and describe their influences on the design of the…
The growth of data to be processed in the Oil & Gas industry matches the requirements imposed by evolving algorithms based on stencil computations, such as Full Waveform Inversion and Reverse Time Migration. Graphical processing units…
We review the recent optimizations of gravitational $N$-body kernels for running them on graphics processing units (GPUs), on single hosts and massive parallel platforms. For each of the two main $N$-body techniques, direct summation and…