Related papers: Autotuning GPU Kernels via Static and Predictive A…

A Benchmark Set of Highly-efficient CUDA and OpenCL Kernels and its Dynamic Autotuning with Kernel Tuning Toolkit

Autotuning of performance-relevant source-code parameters allows to automatically tune applications without hard coding optimizations and thus helps with keeping the performance portable. In this paper, we introduce a benchmark set of ten…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-03-02 Filip Petrovič , David Střelák , Jana Hozzová , Jaroslav Oľha , Richard Trembecký , Siegfried Benkner , Jiří Filipovič

Benchmarking optimization algorithms for auto-tuning GPU kernels

Recent years have witnessed phenomenal growth in the application, and capabilities of Graphical Processing Units (GPUs) due to their high parallel computation power at relatively low cost. However, writing a computationally efficient GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-05 Richard Schoonhoven , Ben van Werkhoven , Kees Joost Batenburg

Searching CUDA code autotuning spaces with hardware performance counters: data from benchmarks running on various GPU architectures

We have developed several autotuning benchmarks in CUDA that take into account performance-relevant source-code parameters and reach near peak-performance on various GPU architectures. We have used them during the development and evaluation…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-02-11 Jiří Filipovič , Jana Hozzová , Amin Nezarat , Jaroslav Oľha , Filip Petrovič

Towards a Benchmarking Suite for Kernel Tuners

As computing system become more complex, it is becoming harder for programmers to keep their codes optimized as the hardware gets updated. Autotuners try to alleviate this by hiding as many architecture-based optimization details as…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-03-17 Jacob O. Tørring , Ben van Werkhoven , Filip Petrovic , Floris-Jan Willemsen , Jirí Filipovic , Anne C. Elster

Going green: optimizing GPUs for energy efficiency through model-steered auto-tuning

Graphics Processing Units (GPUs) have revolutionized the computing landscape over the past decade. However, the growing energy demands of data centres and computing facilities equipped with GPUs come with significant capital and…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-15 Richard Schoonhoven , Bram Veenboer , Ben van Werkhoven , Kees Joost Batenburg

Comprehensive Optimization of Parametric Kernels for Graphics Processing Units

This work deals with the optimization of computer programs targeting Graphics Processing Units (GPUs). The goal is to lift, from programmers to optimizing compilers, the heavy burden of determining program details that are dependent on the…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-16 Xiaohui Chen , Marc Moreno-Maza , Jeeva Paudel , Ning Xie

Analytical Performance Estimation during Code Generation on Modern GPUs

Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-08 Dominik Ernst , Markus Holzer , Georg Hager , Matthias Knorr , Gerhard Wellein

Using hardware performance counters to speed up autotuning convergence on GPUs

Nowadays, GPU accelerators are commonly used to speed up general-purpose computing tasks on a variety of hardware. However, due to the diversity of GPU architectures and processed data, optimization of codes for a particular type of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-20 Jiří Filipovič , Jana Hozzová , Amin Nezarat , Jaroslav Oľha , Filip Petrovič

Pushing the Limits of Online Auto-tuning: Machine Code Optimization in Short-Running Kernels

We propose an online auto-tuning approach for computing kernels. Differently from existing online auto-tuners, which regenerate code with long compilation chains from the source to the binary code, our approach consists on deploying…

Performance · Computer Science 2017-07-17 Fernando Endo , Damien Couroussé , Henri-Pierre Charles

Bayesian Optimization for auto-tuning GPU kernels

Finding optimal parameter configurations for tunable GPU kernels is a non-trivial exercise for large search spaces, even when automated. This poses an optimization task on a non-convex search space, using an expensive to evaluate function…

Machine Learning · Computer Science 2021-12-01 Floris-Jan Willemsen , Rob van Nieuwpoort , Ben van Werkhoven

Machine Learning-driven Autotuning of Graphics Processing Unit Accelerated Computational Fluid Dynamics for Enhanced Performance

Optimizing the performance of computational fluid dynamics (CFD) applications accelerated by graphics processing units (GPUs) is crucial for efficient simulations. In this study, we employed a machine learning-based autotuning technique to…

Performance · Computer Science 2024-02-21 Weicheng Xue , Christohper John Roy

ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code

Automatic code optimization is a complex process that typically involves the application of multiple discrete algorithms that modify the program structure irreversibly. However, the design of these algorithms is often monolithic, and they…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-18 Kazuaki Matsumura , Simon Garcia De Gonzalo , Antonio J. Peña

Analyzing Search Techniques for Autotuning Image-based GPU Kernels: The Impact of Sample Sizes

Modern computing systems are increasingly more complex, with their multicore CPUs and GPUs accelerators changing yearly, if not more often. It thus has become very challenging to write programs that efficiently use the associated complex…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-28 Jacob O. Tørring , Anne C. Elster

A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization

GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially alleviate this issue, achieving…

Computation and Language · Computer Science 2026-01-26 Qiuyi Qu , Yicheng Sui , Yufei Sun , Rui Chen , Xiaofei Zhang , Yuzhi Zhang , Haofeng Wang , Ge Lan

A Tool for Automatically Suggesting Source-Code Optimizations for Complex GPU Kernels

Future computing systems, from handhelds to supercomputers, will undoubtedly be more parallel and heterogeneous than todays systems to provide more performance and energy efficiency. Thus, GPUs are increasingly being used to accelerate…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-18 Saeed Taheri , Apan Qasem , Martin Burtscher

SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation

Large language models (LLMs) have become a significant workload since their appearance. However, they are also computationally expensive as they have billions of parameters and are trained with massive amounts of data. Thus, recent works…

Hardware Architecture · Computer Science 2024-03-26 Guoliang He , Eiko Yoneki

Optimal Kernel Tuning Parameter Prediction using Deep Sequence Models

GPU kernels have come to the forefront of computing due to their utility in varied fields, from high-performance computing to machine learning. A typical GPU compute kernel is invoked millions, if not billions of times in a typical…

Machine Learning · Computer Science 2024-04-18 Khawir Mahmood , Jehandad Khan , Hammad Afzal

Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications

Graphic Processing Units (GPUs) have become ubiquitous in scientific computing. However, writing efficient GPU kernels can be challenging due to the need for careful code tuning. To automatically explore the kernel optimization space,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-03-23 Stijn Heldens , Ben van Werkhoven

Accelerating Distributed-Memory Autotuning via Statistical Analysis of Execution Paths

The prohibitive expense of automatic performance tuning at scale has largely limited the use of autotuning to libraries for shared-memory and GPU architectures. We introduce a framework for approximate autotuning that achieves a desired…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-03 Edward Hutter , Edgar Solomonik

Bringing Auto-tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUs

Many studies have focused on developing and improving auto-tuning algorithms for Nvidia Graphics Processing Units (GPUs), but the effectiveness and efficiency of these approaches on AMD devices have hardly been studied. This paper aims to…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-17 Milo Lurati , Stijn Heldens , Alessio Sclocco , Ben van Werkhoven