Related papers: Toward the Graphics Turing Scale on a Blue Gene Su…
The architecture of the BlueGene/L massively parallel supercomputer is described. Each computing node consists of a single compute ASIC plus 256 MB of external memory. The compute ASIC integrates two 700 MHz PowerPC 440 integer CPU cores,…
Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the…
Genetic Programming (GP) is a computationally intensive technique which also has a high degree of natural parallelism. Parallel computing architectures have become commonplace especially with regards Graphics Processing Units (GPU). Hence,…
In this paper we present scaling results of a FFT library, FFTK, and a pseudospectral code, Tarang, on grid resolutions up to $8192^3$ grid using 65536 cores of Blue Gene/P and 196608 cores of Cray XC40 supercomputers. We observe that…
General Purpose computing on Graphical Processing Units (GPGPU) has resulted in unprecedented levels of speedup over its CPU counterparts, allowing programmers to harness the computational power of GPU shader cores to accelerate other…
Data management on GPUs has become increasingly relevant due to a tremendous rise in processing power and available GPU memory. Similar to main-memory systems, there is a need for performant GPU-resident index structures to speed up query…
Our work addresses the enabling of the execution of highly parallel computations composed of loosely coupled serial jobs with no modifications to the respective applications, on large-scale systems. This approach allows new-and potentially…
Ray tracing is a technique for generating an image by tracing the path of light through pixels in an image plane and simulating the effects of high-quality global illumination at a heavy computational cost. Because of the high computation…
We study ray reordering as a tool for increasing the performance of existing GPU ray tracing implementations. We focus on ray reordering that is fully agnostic to the particular trace kernel. We summarize the existing methods for computing…
We have extended the Falkon lightweight task execution framework to make loosely coupled programming on petascale systems a practical and useful programming model. This work studies and measures the performance factors involved in applying…
Genetic Programming (GP) is a computationally intensive technique which is naturally parallel in nature. Consequently, many attempts have been made to improve its run-time from exploiting highly parallel hardware such as GPUs. However, a…
Beagle is a new software framework that enables execution of Genetic Programming tasks on the GPU. Currently available for symbolic regression, it processes individuals of the population and fitness cases for training in a way that…
Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as TensorCore Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4x4 or…
Training deep neural networks on large scientific data is a challenging task that requires enormous compute power, especially if no pre-trained models exist to initialize the process. We present a novel tournament method to train…
We present a high-performance, graphics processing unit (GPU)-based framework for the efficient analysis and visualization of (nearly) terabyte (TB)-sized 3-dimensional images. Using a cluster of 96 GPUs, we demonstrate for a 0.5 TB image:…
Scaling up the sparse matrix-vector multiplication kernel on modern Graphics Processing Units (GPU) has been at the heart of numerous studies in both academia and industry. In this article we present a novel non-parametric, self-tunable,…
The vision of super computer at every desk can be realized by powerful and highly parallel CPUs or GPUs or APUs. Graphics processors once specialized for the graphics applications only, are now used for the highly computational intensive…
Over the last ten years, graphics processors have become the de facto accelerator for data-parallel tasks in various branches of high-performance computing, including machine learning and computational sciences. However, with the recent…
We present a new numerical scheme to solve the transfer of diffuse radiation on three-dimensional mesh grids which is efficient on processors with highly parallel architecture such as recently popular GPUs and CPUs with multi- and many-core…
Gravitational lensing calculation using a direct inverse ray-shooting approach is a computationally expensive way to determine magnification maps, caustic patterns, and light-curves (e.g. as a function of source profile and size). However,…