Related papers: Optimizing CUDA Code By Kernel Fusion---Applicatio…

Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs

Convolution is a fundamental operation in many applications, such as computer vision, natural language processing, image processing, etc. Recent successes of convolutional neural networks in various deep learning applications put even…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-05-31 Xiaoming Chen , Jianxu Chen , Danny Z. Chen , Xiaobo Sharon Hu

Optimizing Data Collection in Deep Reinforcement Learning

Reinforcement learning (RL) workloads take a notoriously long time to train due to the large number of samples collected at run-time from simulators. Unfortunately, cluster scale-up approaches remain expensive, and commonly used CPU…

Machine Learning · Computer Science 2022-07-19 James Gleeson , Daniel Snider , Yvonne Yang , Moshe Gabel , Eyal de Lara , Gennady Pekhimenko

Efficient Kernel Fusion Techniques for Massive Video Data Analysis on GPGPUs

Kernels are executable code segments and kernel fusion is a technique for combing the segments in a coherent manner to improve execution time. For the first time, we have developed a technique to fuse image processing kernels to be executed…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-09-16 Asif M Adnan , Sridhar Radhakrishnan , Suleyman Karabuk

A Tool for Automatically Suggesting Source-Code Optimizations for Complex GPU Kernels

Future computing systems, from handhelds to supercomputers, will undoubtedly be more parallel and heterogeneous than todays systems to provide more performance and energy efficiency. Thus, GPUs are increasingly being used to accelerate…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-18 Saeed Taheri , Apan Qasem , Martin Burtscher

KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators

KBLAS is a new open source high performance library that provides optimized kernels for a subset of Level 2 BLAS functionalities on CUDA-enabled GPUs. Since performance of dense matrix-vector multiplication is hindered by the overhead of…

Mathematical Software · Computer Science 2014-10-08 Ahmad Abdelfattah , David Keyes , Hatem Ltaief

The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries

Existing GPU libraries often struggle to fully exploit the parallel resources and on-chip memory (SRAM) of GPUs when chaining multiple GPU functions as individual kernels. While Kernel Fusion (KF) techniques like Horizontal Fusion (HF) and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-09 Oscar Amoros , Albert Andaluz , Johnny Nunez , Antonio J. Pena

Automatic Kernel Generation for Volta Tensor Cores

A commonly occurring computation idiom in neural networks is to perform some pointwise operations on the result of a matrix multiplication. Such a sequence of operations is typically represented as a computation graph in deep learning…

Programming Languages · Computer Science 2020-08-04 Somashekaracharya G. Bhaskaracharya , Julien Demouth , Vinod Grover

CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning

Large language models (LLMs) are remarked by their substantial computational requirements. To mitigate the cost, researchers develop specialized CUDA kernels, which often fuse several tensor operations to maximize the utilization of GPUs as…

Hardware Architecture · Computer Science 2025-01-15 Guoliang He , Eiko Yoneki

Automatic Horizontal Fusion for GPU Kernels

We present automatic horizontal fusion, a novel optimization technique that complements the standard kernel fusion techniques for GPU programs. Unlike the standard fusion, whose goal is to eliminate intermediate data round trips, our…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-03 Ao Li , Bojian Zheng , Gennady Pekhimenko , Fan Long

Effective GPU Sharing Under Compiler Guidance

Modern computing platforms tend to deploy multiple GPUs (2, 4, or more) on a single node to boost system performance, with each GPU having a large capacity of global memory and streaming multiprocessors (SMs). GPUs are an expensive…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-07-20 Chao Chen , Chris Porter , Santosh Pande

Operator Fusion in XLA: Analysis and Evaluation

Machine learning (ML) compilers are an active area of research because they offer the potential to automatically speedup tensor programs. Kernel fusion is often cited as an important optimization performed by ML compilers. However, there…

Machine Learning · Computer Science 2023-01-31 Daniel Snider , Ruofan Liang

Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs

Graphics Processing Units (GPUs) have become the standard in accelerating scientific applications on heterogeneous systems. However, as GPUs are getting faster, one potential performance bottleneck with GPU-accelerated applications is the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-01 Jonah Ekelund , Stefano Markidis , Ivy Peng

Analyzing Molecular Simulations Trajectories by Utilizing CUDA on GPU Architecture

With the advent of high-performance computing techniques, the data for analysis has grown significantly. Here, graphic processing unit (GPU) based program kernels are discussed to exploit parallelism in the analysis codes specific to…

Computational Physics · Physics 2018-11-07 Gourav Shrivastav , Manish Agarwal

Fast convolution kernels on pascal GPU with high memory efficiency

The convolution computation is widely used in many fields, especially in CNNs. Because of the rapid growth of the training data in CNNs, GPUs have been used for the acceleration, and memory-efficient algorithms are focused because of thier…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-02 Qiong Chang , Masaki Onishi , Tsutomu Maruyama

Efficient GPU implementation of randomized SVD and its applications

Matrix decompositions are ubiquitous in machine learning, including applications in dimensionality reduction, data compression and deep learning algorithms. Typical solutions for matrix decompositions have polynomial complexity which…

Machine Learning · Computer Science 2024-03-13 Łukasz Struski , Paweł Morkisz , Przemysław Spurek , Samuel Rodriguez Bernabeu , Tomasz Trzciński

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels

Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation. However, generating the code which is deeply hardware-specific, architecture-aware, and performance-critical, especially for massively…

Machine Learning · Computer Science 2025-06-12 Wentao Chen , Jiace Zhu , Qi Fan , Yehan Ma , An Zou

Fast GPGPU Data Rearrangement Kernels using CUDA

Many high performance-computing algorithms are bandwidth limited, hence the need for optimal data rearrangement kernels as well as their easy integration into the rest of the application. In this work, we have built a CUDA library of fast…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-11-17 Michael Bader , Hans-Joachim Bungartz , Dheevatsa Mudigere , Srihari Narasimhan , Babu Narayanan

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the…

Machine Learning · Computer Science 2025-12-15 Songqiao Su , Xiaofei Sun , Xiaoya Li , Albert Wang , Jiwei Li , Chris Shum

How to obtain efficient GPU kernels: an illustration using FMM & FGT algorithms

Computing on graphics processors is maybe one of the most important developments in computational science to happen in decades. Not since the arrival of the Beowulf cluster, which combined open source software with commodity hardware to…

Mathematical Software · Computer Science 2011-09-21 Felipe A. Cruz , Simon K. Layton , Lorena A. Barba

Execution of Compound Multi-Kernel OpenCL Computations in Multi-CPU/Multi-GPU Environments

Current computational systems are heterogeneous by nature, featuring a combination of CPUs and GPUs. As the latter are becoming an established platform for high-performance computing, the focus is shifting towards the seamless programming…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-10-23 Fábio Soldado , Fernando Alexandre , Hervé Paulino