Related papers: Automatic Horizontal Fusion for GPU Kernels

The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries

Existing GPU libraries often struggle to fully exploit the parallel resources and on-chip memory (SRAM) of GPUs when chaining multiple GPU functions as individual kernels. While Kernel Fusion (KF) techniques like Horizontal Fusion (HF) and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-09 Oscar Amoros , Albert Andaluz , Johnny Nunez , Antonio J. Pena

Optimizing CUDA Code By Kernel Fusion---Application on BLAS

Modern GPUs are able to perform significantly more arithmetic operations than transfers of a single word to or from global memory. Hence, many GPU kernels are limited by memory bandwidth and cannot exploit the arithmetic power of GPUs.…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-13 J. Filipovič , M. Madzin , J. Fousek , L. Matyska

Efficient Kernel Fusion Techniques for Massive Video Data Analysis on GPGPUs

Kernels are executable code segments and kernel fusion is a technique for combing the segments in a coherent manner to improve execution time. For the first time, we have developed a technique to fuse image processing kernels to be executed…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-09-16 Asif M Adnan , Sridhar Radhakrishnan , Suleyman Karabuk

Hyperbolic Diffusion in Flux Reconstruction: Optimisation through Kernel Fusion within Tensor-Product Elements

Novel methods are presented in this initial study for the fusion of GPU kernels in the artificial compressibility method (ACM), using tensor product elements with constant Jacobians and flux reconstruction. This is made possible through the…

Mathematical Software · Computer Science 2022-01-05 Will Trojak , Rob Watson , Freddie Witherden

Automatic Kernel Generation for Volta Tensor Cores

A commonly occurring computation idiom in neural networks is to perform some pointwise operations on the result of a matrix multiplication. Such a sequence of operations is typically represented as a computation graph in deep learning…

Programming Languages · Computer Science 2020-08-04 Somashekaracharya G. Bhaskaracharya , Julien Demouth , Vinod Grover

Hybrid Fusion: One-Minute Efficient Training for Zero-Shot Cross-Domain Image Fusion

Image fusion seeks to integrate complementary information from multiple sources into a single, superior image. While traditional methods are fast, they lack adaptability and performance. Conversely, deep learning approaches achieve…

Computer Vision and Pattern Recognition · Computer Science 2026-02-25 Ran Zhang , Xuanhua He , Liu Liu

Theoretical Foundations of GPU-Native Compilation for Rapid Code Iteration

Current AI code generation systems suffer from significant latency bottlenecks due to CPU-GPU data transfers during compilation, execution, and testing phases. We establish theoretical foundations for three complementary approaches to…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-15 Adilet Metinov , Gulida M. Kudakeeva , Gulnara D. Kabaeva

Accelerating the Convex Hull Computation with a Parallel GPU Algorithm

The convex hull is a fundamental geometrical structure for many applications where groups of points must be enclosed or represented by a convex polygon. Although efficient sequential convex hull algorithms exist, and are constantly being…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-27 Alan Keith , Héctor Ferrada , Cristóbal A. Navarro

An Evaluation of GPU Filters for Accelerating the 2D Convex Hull

The Convex Hull algorithm is one of the most important algorithms in computational geometry, with many applications such as in computer graphics, robotics, and data mining. Despite the advances in the new algorithms in this area, it is…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-03-21 Roberto Carrasco , Héctor Ferrada , Cristóbal A. Navarro , Nancy Hitschfeld

Efficient Automatic Scheduling of Imaging and Vision Pipelines for the GPU

We present a new algorithm to quickly generate high-performance GPU implementations of complex imaging and vision pipelines, directly from high-level Halide algorithm code. It is fully automatic, requiring no schedule templates or…

Programming Languages · Computer Science 2023-08-29 Luke Anderson , Andrew Adams , Karima Ma , Tzu-Mao Li , Tian Jin , Jonathan Ragan-Kelley

Efficient hybrid topology optimization using GPU and homogenization based multigrid approach

We propose a new hybrid topology optimization algorithm based on multigrid approach that combines the parallelization strategy of CPU using OpenMP and heavily multithreading capabilities of modern Graphics Processing Units (GPU). In…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-01 Arya Prakash Padhi , Souvik Chakraborty , Anupam Chakrabarti , Rajib Chowdhury

Model-Based Warp Overlapped Tiling for Image Processing Programs on GPUs

Domain-specific languages that execute image processing pipelineson GPUs, such as Halide and Forma, operate by 1) dividing the image into overlapped tiles, and 2) fusing loops to improve memory locality. However, current approaches have…

Programming Languages · Computer Science 2020-09-09 Abhinav Jangda , Arjun Guha

High-Performance Code Generation though Fusion and Vectorization

We present a technique for automatically transforming kernel-based computations in disparate, nested loops into a fused, vectorized form that can reduce intermediate storage needs and lead to improved performance on contemporary hardware.…

Performance · Computer Science 2017-10-25 Jason Sewall , Simon J. Pennycook

Performance Acceleration of Kernel Polynomial Method Applying Graphics Processing Units

The Kernel Polynomial Method (KPM) is one of the fast diagonalization methods used for simulations of quantum systems in research fields of condensed matter physics and chemistry. The algorithm has a difficulty to be parallelized on a…

Computational Physics · Physics 2011-05-30 Shixun Zhang , Shinichi Yamagiwa , Masahiko Okumura , Seiji Yunoki

MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators

Operator fusion, a key technique to improve data locality and alleviate GPU memory bandwidth pressure, often fails to extend to the fusion of multiple compute-intensive operators due to saturated computation throughput. However, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-30 Zheng Zhang , Donglin Yang , Xiaobo Zhou , Dazhao Cheng

Accelerating Mini-batch HGNN Training by Reducing CUDA Kernels

Heterogeneous graph neural networks (HGNNs) are essential for capturing the structure and semantic information in heterogeneous graphs. However, existing GPU-based solutions, such as PyTorch Geometric, suffer from low GPU utilization due to…

Hardware Architecture · Computer Science 2024-08-19 Meng Wu , Jingkai Qiu , Mingyu Yan , Wenming Li , Yang Zhang , Zhimin Zhang , Xiaochun Ye , Dongrui Fan

Fast convolution kernels on pascal GPU with high memory efficiency

The convolution computation is widely used in many fields, especially in CNNs. Because of the rapid growth of the training data in CNNs, GPUs have been used for the acceleration, and memory-efficient algorithms are focused because of thier…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-02 Qiong Chang , Masaki Onishi , Tsutomu Maruyama

Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling

Diffusion models have achieved remarkable progress in high-fidelity image, video, and audio generation, yet inference remains computationally expensive. Nevertheless, current diffusion acceleration methods based on distributed parallelism…

Computer Vision and Pattern Recognition · Computer Science 2026-02-26 Euisoo Jung , Byunghyun Kim , Hyunjin Kim , Seonghye Cho , Jae-Gil Lee

Accelerating Deep Learning Inference with Cross-Layer Data Reuse on GPUs

Accelerating the deep learning inference is very important for real-time applications. In this paper, we propose a novel method to fuse the layers of convolutional neural networks (CNNs) on Graphics Processing Units (GPUs), which applies…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-30 Xueying Wang , Guangli Li , Xiao Dong , Jiansong Li , Lei Liu , Xiaobing Feng

Kernel Fusion in Atomistic Spin Dynamics Simulations on Nvidia GPUs using Tensor Core

In atomistic spin dynamics simulations, the time cost of constructing the space- and time-displaced pair correlation function in real space increases quadratically as the number of spins $N$, leading to significant computational effort. The…

Computational Physics · Physics 2023-08-16 Hongwei Chen , Shiyang Chen , Joshua J. Turner , Adrian Feiguin