Related papers: Lightning: Scaling the GPU Programming Model Beyon…
We present a single-node, multi-GPU programmable graph processing library that allows programmers to easily extend single-GPU graph algorithms to achieve scalable performance on large graphs with billions of edges. Directly using the…
GPUs are now used for a wide range of problems within HPC. However, making efficient use of the computational power available with multiple GPUs is challenging. The main challenges in achieving good performance are memory layout, affecting…
Modern computing platforms tend to deploy multiple GPUs (2, 4, or more) on a single node to boost system performance, with each GPU having a large capacity of global memory and streaming multiprocessors (SMs). GPUs are an expensive…
In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled and executed cheaply, which fails in large…
Choosing an appropriate programming paradigm for high-performance computing on low-power devices can be useful to speed up calculations. Many Android devices have an integrated GPU and - although not officially supported - the OpenCL…
Efficient deployment of large language models, particularly Mixture of Experts (MoE), on resource-constrained platforms presents significant challenges, especially in terms of computational efficiency and memory utilization. The MoE…
We introduce PennyLane's Lightning suite, a collection of high-performance state-vector simulators targeting CPU, GPU, and HPC-native architectures and workloads. Quantum applications such as QAOA, VQE, and synthetic workloads are…
Parallel computing can offer an enormous advantage regarding the performance for very large applications in almost any field: scientific computing, computer vision, databases, data mining, and economics. GPUs are high performance many-core…
A modern graphics processing unit (GPU) is able to perform massively parallel scientific computations at low cost. We extend our implementation of the checkerboard algorithm for the two dimensional Ising model [T. Preis et al., J. Comp.…
The simulation of the two-dimensional Ising model is used as a benchmark to show the computational capabilities of Graphic Processing Units (GPUs). The rich programming environment now available on GPUs and flexible hardware capabilities…
The ability to model, analyze, and predict execution time of computations is an important building block supporting numerous efforts, such as load balancing, performance optimization, and automated performance tuning for high performance,…
Accurate determination of the performance of parallel GPU code typically requires execution-time profiling on target hardware -- an increasingly prohibitive step due to limited access to high-end GPUs. This paper explores whether Large…
GPUs have significantly accelerated first-order methods for large-scale optimization, especially in continuous optimization. However, this success has not transferred cleanly to problems with discrete variables, combinatorial structure, and…
Training and deploying deep learning models in real-world applications require processing large amounts of data. This is a challenging task when the amount of data grows to a hundred terabytes, or even, petabyte-scale. We introduce a hybrid…
Due to their highly parallel multi-cores architecture, GPUs are being increasingly used in a wide range of computationally intensive applications. Compared to CPUs, GPUs can achieve higher performances at accelerating the programs'…
For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library.…
GPU clouds have become a popular computing platform because of the cost of owning and maintaining high-performance computing clusters. Many cloud architectures have also been proposed to ensure a secure execution environment for guest…
The development of composable systems architecture marks a significant shift in resource allocation and utilization within data centers. This paper presents a composable architecture scaling up to 32 GPUs on a single node, addressing the…
Modeling data sharing in GPU programs is a challenging task because of the massive parallelism and complex data sharing patterns provided by GPU architectures. Better GPU caching efficiency can be achieved through careful task scheduling…
In the quest for highest performance in scientific computing, we present a novel framework that relies on high-bandwidth communication between GPUs in a compute cluster. The framework offers linear scaling of performance for explicit…