Related papers: Code Optimization on Kepler GPUs and Xeon Phi

Performance of Kepler GTX Titan GPUs and Xeon Phi System

NVIDIA's new architecture, Kepler improves GPU's performance significantly with the new streaming multiprocessor SMX. Along with the performance, NVIDIA has also introduced many new technologies such as direct parallelism, hyper-Q and GPU…

Computational Physics · Physics 2013-11-05 Hwancheol Jeong , Weonjong Lee , Jeonghwan Pak , Kwang-jong Choi , Sang-Hyun Park , Jun-sik Yoo , Joo Hwan Kim , Joungjin Lee , Young Woo Lee

Performance of GTX Titan X GPUs and Code Optimization

Recently Nvidia has released a new GPU model: GTX Titan X (TX) in a linage of the Maxwell architecture. We use our conjugate gradient code and non-perturbative renormalization code to measure the performance of TX. The results are compared…

High Energy Physics - Lattice · Physics 2015-11-03 Hwancheol Jeong , Sangbaek Lee , Weonjong Lee , Jeonghwan Pak , Jangho Kim , Juhyun Chung

Conjugate gradient solvers on Intel Xeon Phi and NVIDIA GPUs

Lattice Quantum Chromodynamics simulations typically spend most of the runtime in inversions of the Fermion Matrix. This part is therefore frequently optimized for various HPC architectures. Here we compare the performance of the Intel Xeon…

Computational Physics · Physics 2014-11-18 O. Kaczmarek , C. Schmidt , P. Steinbrecher , M. Wagner

HISQ inverter on Intel Xeon Phi and NVIDIA GPUs

The runtime of a Lattice QCD simulation is dominated by a small kernel, which calculates the product of a vector by a sparse matrix known as the "Dslash" operator. Therefore, this kernel is frequently optimized for various HPC…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-09-05 O. Kaczmarek , C. Schmidt , P. Steinbrecher , Swagato Mukherjee , M. Wagner

Evaluation of the Intel Xeon Phi 7120 and NVIDIA K80 as accelerators for two-dimensional panel codes

To optimize the geometry of airfoils for a specific application is an important engineering problem. In this context genetic algorithms have enjoyed some success as they are able to explore the search space without getting stuck in local…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-08-14 Lukas Einkemmer

Accelerating gravitational microlensing simulations using the Xeon Phi coprocessor

Recently Graphics Processing Units (GPUs) have been used to speed up very CPU-intensive gravitational microlensing simulations. In this work, we use the Xeon Phi coprocessor to accelerate such simulations and compare its performance on a…

Instrumentation and Methods for Astrophysics · Physics 2017-03-30 Bin Chen , Ronald Kantowski , Xinyu Dai , Eddie Baron , Paul Van der Mark

MILC Code Performance on High End CPU and GPU Supercomputer Clusters

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to…

High Energy Physics - Lattice · Physics 2017-12-04 Ruizi Li , Carleton DeTar , Steven Gottlieb , Doug Toussaint

Finite temperature lattice QCD with GPUs

Graphics Processing Units (GPUs) are being used in many areas of physics, since the performance versus cost is very attractive. The GPUs can be addressed by CUDA which is a NVIDIA's parallel computing architecture. It enables dramatic…

High Energy Physics - Lattice · Physics 2012-10-12 Nuno Cardoso , Marco Cardoso , Pedro Bicudo

Accelerator Codesign as Non-Linear Optimization

We propose an optimization approach for determining both hardware and software parameters for the efficient implementation of a (family of) applications called dense stencil computations on programmable GPGPUs. We first introduce a simple,…

Hardware Architecture · Computer Science 2017-12-26 Nirmal Prajapati , Sanjay Rajopadhye , Hristo Djidjev , Nandkishore Santhi , Tobias Grosser , Rumen Andonov

Accelerating Lattice QCD Multigrid on GPUs Using Fine-Grained Parallelization

The past decade has witnessed a dramatic acceleration of lattice quantum chromodynamics calculations in nuclear and particle physics. This has been due to both significant progress in accelerating the iterative linear solvers using…

High Energy Physics - Lattice · Physics 2016-12-26 M. A. Clark , Bálint Joó , Alexei Strelchenko , Michael Cheng , Arjun Gambhir , Richard Brower

Optimization of Tensor-product Operations in Nekbone on GPUs

In the CFD solver Nek5000, the computation is dominated by the evaluation of small tensor operations. Nekbone is a proxy app for Nek5000 and has previously been ported to GPUs with a mixed OpenACC and CUDA approach. In this work, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-05-28 Martin Karp , Niclas Jansson , Artur Podobas , Philipp Schlatter , Stefano Markidis

Up to 700k GPU cores, Kepler, and the Exascale future for simulations of star clusters around black holes

We present direct astrophysical N-body simulations with up to a few million bodies using our parallel MPI/CUDA code on large GPU clusters in China, Ukraine and Germany, with different kinds of GPU hardware. These clusters are directly…

Instrumentation and Methods for Astrophysics · Physics 2013-12-09 P. Berczik , R. Spurzem , L. Wang , S. Zhong , O. Veles , I. Zinchenko , S. Huang , M. Tsai , G. Kennedy , S. Li , L. Naso , C. Li

Performance Evaluation and Acceleration of the QTensor Quantum Circuit Simulator on GPUs

This work studies the porting and optimization of the tensor network simulator QTensor on GPUs, with the ultimate goal of simulating quantum circuits efficiently at scale on large GPU supercomputers. We implement NumPy, PyTorch, and CuPy…

Quantum Physics · Physics 2022-04-14 Danylo Lykov , Angela Chen , Huaxuan Chen , Kristopher Keipert , Zheng Zhang , Tom Gibbs , Yuri Alexeev

Performance Analysis and Efficient Execution on Systems with multi-core CPUs, GPUs and MICs

We carry out a comparative performance study of multi-core CPUs, GPUs and Intel Xeon Phi (Many Integrated Core - MIC) with a microscopy image analysis application. We experimentally evaluate the performance of computing devices on core…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-05-15 George Teodoro , Tahsin Kurc , Guilherme Andrade , Jun Kong , Renato Ferreira , Joel Saltz

Gravitational octree code performance evaluation on Volta GPU

In this study, the gravitational octree code originally optimized for the Fermi, Kepler, and Maxwell GPU architectures is adapted to the Volta architecture. The Volta architecture introduces independent thread scheduling requiring either…

Mathematical Software · Computer Science 2018-11-08 Yohei Miki

Exact diagonalization of quantum lattice models on coprocessors

We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics…

Strongly Correlated Electrons · Physics 2016-09-21 Topi Siro , Ari Harju

Comparison of HPC Architectures for Computing All-Pairs Shortest Paths. Intel Xeon Phi KNL vs NVIDIA Pascal

Today, one of the main challenges for high-performance computing systems is to improve their performance by keeping energy consumption at acceptable levels. In this context, a consolidated strategy consists of using accelerators such as…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-18 Manuel Costanzo , Enzo Rucci , Ulises Costi , Franco Chichizola , Marcelo Naiouf

Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics

Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision…

High Energy Physics - Lattice · Physics 2010-12-06 Ronald Babich , Michael A. Clark , Bálint Joó

Fast quantum Monte Carlo on a GPU

We present a scheme for the parallelization of quantum Monte Carlo on graphical processing units, focusing on bosonic systems and variational Monte Carlo. We use asynchronous execution schemes with shared memory persistence, and obtain an…

Computational Physics · Physics 2014-12-10 Y. Lutsyshyn

GPU Computing with Python: Performance, Energy Efficiency and Usability

In this work, we examine the performance, energy efficiency and usability when using Python for developing HPC codes running on the GPU. We investigate the portability of performance and energy efficiency between CUDA and OpenCL; between…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-03-11 Håvard H. Holm , André R. Brodtkorb , Martin L. Sætra