Related papers: The Fused Kernel Library: A C++ API to Develop Hig…

Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications

Graphic Processing Units (GPUs) have become ubiquitous in scientific computing. However, writing efficient GPU kernels can be challenging due to the need for careful code tuning. To automatically explore the kernel optimization space,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-03-23 Stijn Heldens , Ben van Werkhoven

Automatic Horizontal Fusion for GPU Kernels

We present automatic horizontal fusion, a novel optimization technique that complements the standard kernel fusion techniques for GPU programs. Unlike the standard fusion, whose goal is to eliminate intermediate data round trips, our…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-03 Ao Li , Bojian Zheng , Gennady Pekhimenko , Fan Long

Optimizing CUDA Code By Kernel Fusion---Application on BLAS

Modern GPUs are able to perform significantly more arithmetic operations than transfers of a single word to or from global memory. Hence, many GPU kernels are limited by memory bandwidth and cannot exploit the arithmetic power of GPUs.…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-13 J. Filipovič , M. Madzin , J. Fousek , L. Matyska

A Comparison of Support Vector Machines Training GPU-Accelerated Open Source Implementations

Last several years, GPUs are used to accelerate computations in many computer science domains. We focused on GPU accelerated Support Vector Machines (SVM) training with non-linear kernel functions. We had searched for all available GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-07-21 Jan Vanek , Josef Michalek , Josef Psutka

High-Performance Code Generation though Fusion and Vectorization

We present a technique for automatically transforming kernel-based computations in disparate, nested loops into a fused, vectorized form that can reduce intermediate storage needs and lead to improved performance on contemporary hardware.…

Performance · Computer Science 2017-10-25 Jason Sewall , Simon J. Pennycook

libhclooc: Software Library Facilitating Out-of-core Implementations of Accelerator Kernels on Hybrid Computing Platforms

Hardware accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors (PHIs), and Field-Programmable Gate Arrays (FPGAs) are now ubiquitous in extreme-scale high performance computing (HPC), cloud, and Big data…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-08-16 Daniel Hanlon , Hamidreza Khalighzadeh , Ravi Reddy Manumachu , Alexey Lastovetsky

Efficient Hybrid Execution of C++ Applications using Intel(R) Xeon Phi(TM) Coprocessor

The introduction of Intel(R) Xeon Phi(TM) coprocessors opened up new possibilities in development of highly parallel applications. The familiarity and flexibility of the architecture together with compiler support integrated into the Intel…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-11-26 Jiri Dokulil , Enes Bajrovic , Siegfried Benkner , Sabri Pllana , Martin Sandrieser , Beverly Bachmayer

Flexible Performant GEMM Kernels on GPUs

General Matrix Multiplication or GEMM kernels take centre place in high performance computing and machine learning. Recent NVIDIA GPUs include GEMM accelerators, such as NVIDIA's Tensor Cores. Their exploitation is hampered by the…

Mathematical Software · Computer Science 2021-11-23 Thomas Faingnaert , Tim Besard , Bjorn De Sutter

Concurrent CPU-GPU Task Programming using Modern C++

In this paper, we introduce Heteroflow, a new C++ library to help developers quickly write parallel CPU-GPU programs using task dependency graphs. Heteroflow leverages the power of modern C++ and task-based approaches to enable efficient…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-17 Tsung-Wei Huang , Yibo Lin

GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs

In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled and executed cheaply, which fails in large…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-30 Ruifan Chu , Anbang Wang , Xiuxiu Bai , Shuai Liu , Xiaoshe Dong

Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods

In this paper we focus on the integration of high-performance numerical libraries in ab initio codes and the portability of performance and scalability. The target of our work is FLEUR, a software for electronic structure calculations…

Computational Engineering, Finance, and Science · Computer Science 2016-11-03 Diego Fabregat-Traver , Davor Davidović , Markus Höhnerbach , Edoardo Di Napoli

A Multi-GPU Programming Library for Real-Time Applications

We present MGPU, a C++ programming library targeted at single-node multi-GPU systems. Such systems combine disproportionate floating point performance with high data locality and are thus well suited to implement real-time algorithms. We…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-03-03 Sebastian Schaetz , Martin Uecker

FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive Operators via Inter-Core Connection

The scaling of computation throughput continues to outpace improvements in memory bandwidth, making many deep learning workloads memory-bound. Kernel fusion is a key technique to alleviate this problem, but the fusion strategies of existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-16 Ziyu Huang , Yangjie Zhou , Zihan Liu , Xinhao Luo , Yijia Diao , Minyi Guo , Jidong Zhai , Yu Feng , Chen Zhang , Anbang Wu , Jingwen Leng

stdgpu: Efficient STL-like Data Structures on the GPU

Tremendous advances in parallel computing and graphics hardware opened up several novel real-time GPU applications in the fields of computer vision, computer graphics as well as augmented reality (AR) and virtual reality (VR). Although…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-19 Patrick Stotko

VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference

In this work, we design and implement VQ-LLM, an efficient fused Vector Quantization (VQ) kernel generation framework. We first introduce a software abstraction called codebook cache to optimize codebook access efficiency and support the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-01 Zihan Liu , Xinhao Luo , Junxian Guo , Wentao Ni , Yangjie Zhou , Yue Guan , Cong Guo , Weihao Cui , Yu Feng , Minyi Guo , Yuhao Zhu , Minjia Zhang , Jingwen Leng , Chen Jin

A GPU Based Memory Optimized Parallel Method For FFT Implementation

FFT (fast Fourier transform) plays a very important role in many fields, such as digital signal processing, digital image processing and so on. However, in application, FFT becomes a factor of affecting the processing efficiency, especially…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-07-25 Fan Zhang , Chen Hu , Qiang Yin , Wei Hu

On the energy efficiency of sparse matrix computations on multi-GPU clusters

We investigate the energy efficiency of a library designed for parallel computations with sparse matrices. The library leverages high-performance, energy-efficient Graphics Processing Unit (GPU) accelerators to enable large-scale scientific…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-16 Massimo Bernaschi , Alessandro Celestini , Pasqua D'Ambra , Giorgio Richelli

Fast GPU Linear Algebra via Compile Time Expression Fusion

We describe the Bandicoot GPU linear algebra toolkit, a C++ based library that prioritises ease of use without compromising efficiency. Bandicoot's API is compatible with the popular Armadillo CPU linear algebra library, enabling easy…

Mathematical Software · Computer Science 2026-04-27 Ryan R. Curtin , Marcus Edel , Conrad Sanderson

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels

Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation. However, generating the code which is deeply hardware-specific, architecture-aware, and performance-critical, especially for massively…

Machine Learning · Computer Science 2025-06-12 Wentao Chen , Jiace Zhu , Qi Fan , Yehan Ma , An Zou

Effective GPU Sharing Under Compiler Guidance

Modern computing platforms tend to deploy multiple GPUs (2, 4, or more) on a single node to boost system performance, with each GPU having a large capacity of global memory and streaming multiprocessors (SMs). GPUs are an expensive…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-07-20 Chao Chen , Chris Porter , Santosh Pande