Related papers: Automatic Horizontal Fusion for GPU Kernels
Existing GPU libraries often struggle to fully exploit the parallel resources and on-chip memory (SRAM) of GPUs when chaining multiple GPU functions as individual kernels. While Kernel Fusion (KF) techniques like Horizontal Fusion (HF) and…
Modern GPUs are able to perform significantly more arithmetic operations than transfers of a single word to or from global memory. Hence, many GPU kernels are limited by memory bandwidth and cannot exploit the arithmetic power of GPUs.…
Kernels are executable code segments and kernel fusion is a technique for combing the segments in a coherent manner to improve execution time. For the first time, we have developed a technique to fuse image processing kernels to be executed…
Novel methods are presented in this initial study for the fusion of GPU kernels in the artificial compressibility method (ACM), using tensor product elements with constant Jacobians and flux reconstruction. This is made possible through the…
A commonly occurring computation idiom in neural networks is to perform some pointwise operations on the result of a matrix multiplication. Such a sequence of operations is typically represented as a computation graph in deep learning…
Image fusion seeks to integrate complementary information from multiple sources into a single, superior image. While traditional methods are fast, they lack adaptability and performance. Conversely, deep learning approaches achieve…
Current AI code generation systems suffer from significant latency bottlenecks due to CPU-GPU data transfers during compilation, execution, and testing phases. We establish theoretical foundations for three complementary approaches to…
The convex hull is a fundamental geometrical structure for many applications where groups of points must be enclosed or represented by a convex polygon. Although efficient sequential convex hull algorithms exist, and are constantly being…
The Convex Hull algorithm is one of the most important algorithms in computational geometry, with many applications such as in computer graphics, robotics, and data mining. Despite the advances in the new algorithms in this area, it is…
We present a new algorithm to quickly generate high-performance GPU implementations of complex imaging and vision pipelines, directly from high-level Halide algorithm code. It is fully automatic, requiring no schedule templates or…
We propose a new hybrid topology optimization algorithm based on multigrid approach that combines the parallelization strategy of CPU using OpenMP and heavily multithreading capabilities of modern Graphics Processing Units (GPU). In…
Domain-specific languages that execute image processing pipelineson GPUs, such as Halide and Forma, operate by 1) dividing the image into overlapped tiles, and 2) fusing loops to improve memory locality. However, current approaches have…
We present a technique for automatically transforming kernel-based computations in disparate, nested loops into a fused, vectorized form that can reduce intermediate storage needs and lead to improved performance on contemporary hardware.…
The Kernel Polynomial Method (KPM) is one of the fast diagonalization methods used for simulations of quantum systems in research fields of condensed matter physics and chemistry. The algorithm has a difficulty to be parallelized on a…
Operator fusion, a key technique to improve data locality and alleviate GPU memory bandwidth pressure, often fails to extend to the fusion of multiple compute-intensive operators due to saturated computation throughput. However, the…
Heterogeneous graph neural networks (HGNNs) are essential for capturing the structure and semantic information in heterogeneous graphs. However, existing GPU-based solutions, such as PyTorch Geometric, suffer from low GPU utilization due to…
The convolution computation is widely used in many fields, especially in CNNs. Because of the rapid growth of the training data in CNNs, GPUs have been used for the acceleration, and memory-efficient algorithms are focused because of thier…
Diffusion models have achieved remarkable progress in high-fidelity image, video, and audio generation, yet inference remains computationally expensive. Nevertheless, current diffusion acceleration methods based on distributed parallelism…
Accelerating the deep learning inference is very important for real-time applications. In this paper, we propose a novel method to fuse the layers of convolutional neural networks (CNNs) on Graphics Processing Units (GPUs), which applies…
In atomistic spin dynamics simulations, the time cost of constructing the space- and time-displaced pair correlation function in real space increases quadratically as the number of spins $N$, leading to significant computational effort. The…