Related papers: The Fused Kernel Library: A C++ API to Develop Hig…
Graphic Processing Units (GPUs) have become ubiquitous in scientific computing. However, writing efficient GPU kernels can be challenging due to the need for careful code tuning. To automatically explore the kernel optimization space,…
We present automatic horizontal fusion, a novel optimization technique that complements the standard kernel fusion techniques for GPU programs. Unlike the standard fusion, whose goal is to eliminate intermediate data round trips, our…
Modern GPUs are able to perform significantly more arithmetic operations than transfers of a single word to or from global memory. Hence, many GPU kernels are limited by memory bandwidth and cannot exploit the arithmetic power of GPUs.…
Last several years, GPUs are used to accelerate computations in many computer science domains. We focused on GPU accelerated Support Vector Machines (SVM) training with non-linear kernel functions. We had searched for all available GPU…
We present a technique for automatically transforming kernel-based computations in disparate, nested loops into a fused, vectorized form that can reduce intermediate storage needs and lead to improved performance on contemporary hardware.…
Hardware accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors (PHIs), and Field-Programmable Gate Arrays (FPGAs) are now ubiquitous in extreme-scale high performance computing (HPC), cloud, and Big data…
The introduction of Intel(R) Xeon Phi(TM) coprocessors opened up new possibilities in development of highly parallel applications. The familiarity and flexibility of the architecture together with compiler support integrated into the Intel…
General Matrix Multiplication or GEMM kernels take centre place in high performance computing and machine learning. Recent NVIDIA GPUs include GEMM accelerators, such as NVIDIA's Tensor Cores. Their exploitation is hampered by the…
In this paper, we introduce Heteroflow, a new C++ library to help developers quickly write parallel CPU-GPU programs using task dependency graphs. Heteroflow leverages the power of modern C++ and task-based approaches to enable efficient…
In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled and executed cheaply, which fails in large…
In this paper we focus on the integration of high-performance numerical libraries in ab initio codes and the portability of performance and scalability. The target of our work is FLEUR, a software for electronic structure calculations…
We present MGPU, a C++ programming library targeted at single-node multi-GPU systems. Such systems combine disproportionate floating point performance with high data locality and are thus well suited to implement real-time algorithms. We…
The scaling of computation throughput continues to outpace improvements in memory bandwidth, making many deep learning workloads memory-bound. Kernel fusion is a key technique to alleviate this problem, but the fusion strategies of existing…
Tremendous advances in parallel computing and graphics hardware opened up several novel real-time GPU applications in the fields of computer vision, computer graphics as well as augmented reality (AR) and virtual reality (VR). Although…
In this work, we design and implement VQ-LLM, an efficient fused Vector Quantization (VQ) kernel generation framework. We first introduce a software abstraction called codebook cache to optimize codebook access efficiency and support the…
FFT (fast Fourier transform) plays a very important role in many fields, such as digital signal processing, digital image processing and so on. However, in application, FFT becomes a factor of affecting the processing efficiency, especially…
We investigate the energy efficiency of a library designed for parallel computations with sparse matrices. The library leverages high-performance, energy-efficient Graphics Processing Unit (GPU) accelerators to enable large-scale scientific…
We describe the Bandicoot GPU linear algebra toolkit, a C++ based library that prioritises ease of use without compromising efficiency. Bandicoot's API is compatible with the popular Armadillo CPU linear algebra library, enabling easy…
Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation. However, generating the code which is deeply hardware-specific, architecture-aware, and performance-critical, especially for massively…
Modern computing platforms tend to deploy multiple GPUs (2, 4, or more) on a single node to boost system performance, with each GPU having a large capacity of global memory and streaming multiprocessors (SMs). GPUs are an expensive…