Related papers: Autotuning GPU Kernels via Static and Predictive A…
Autotuning of performance-relevant source-code parameters allows to automatically tune applications without hard coding optimizations and thus helps with keeping the performance portable. In this paper, we introduce a benchmark set of ten…
Recent years have witnessed phenomenal growth in the application, and capabilities of Graphical Processing Units (GPUs) due to their high parallel computation power at relatively low cost. However, writing a computationally efficient GPU…
We have developed several autotuning benchmarks in CUDA that take into account performance-relevant source-code parameters and reach near peak-performance on various GPU architectures. We have used them during the development and evaluation…
As computing system become more complex, it is becoming harder for programmers to keep their codes optimized as the hardware gets updated. Autotuners try to alleviate this by hiding as many architecture-based optimization details as…
Graphics Processing Units (GPUs) have revolutionized the computing landscape over the past decade. However, the growing energy demands of data centres and computing facilities equipped with GPUs come with significant capital and…
This work deals with the optimization of computer programs targeting Graphics Processing Units (GPUs). The goal is to lift, from programmers to optimizing compilers, the heavy burden of determining program details that are dependent on the…
Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations,…
Nowadays, GPU accelerators are commonly used to speed up general-purpose computing tasks on a variety of hardware. However, due to the diversity of GPU architectures and processed data, optimization of codes for a particular type of…
We propose an online auto-tuning approach for computing kernels. Differently from existing online auto-tuners, which regenerate code with long compilation chains from the source to the binary code, our approach consists on deploying…
Finding optimal parameter configurations for tunable GPU kernels is a non-trivial exercise for large search spaces, even when automated. This poses an optimization task on a non-convex search space, using an expensive to evaluate function…
Optimizing the performance of computational fluid dynamics (CFD) applications accelerated by graphics processing units (GPUs) is crucial for efficient simulations. In this study, we employed a machine learning-based autotuning technique to…
Automatic code optimization is a complex process that typically involves the application of multiple discrete algorithms that modify the program structure irreversibly. However, the design of these algorithms is often monolithic, and they…
Modern computing systems are increasingly more complex, with their multicore CPUs and GPUs accelerators changing yearly, if not more often. It thus has become very challenging to write programs that efficiently use the associated complex…
GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially alleviate this issue, achieving…
Future computing systems, from handhelds to supercomputers, will undoubtedly be more parallel and heterogeneous than todays systems to provide more performance and energy efficiency. Thus, GPUs are increasingly being used to accelerate…
Large language models (LLMs) have become a significant workload since their appearance. However, they are also computationally expensive as they have billions of parameters and are trained with massive amounts of data. Thus, recent works…
GPU kernels have come to the forefront of computing due to their utility in varied fields, from high-performance computing to machine learning. A typical GPU compute kernel is invoked millions, if not billions of times in a typical…
Graphic Processing Units (GPUs) have become ubiquitous in scientific computing. However, writing efficient GPU kernels can be challenging due to the need for careful code tuning. To automatically explore the kernel optimization space,…
The prohibitive expense of automatic performance tuning at scale has largely limited the use of autotuning to libraries for shared-memory and GPU architectures. We introduce a framework for approximate autotuning that achieves a desired…
Many studies have focused on developing and improving auto-tuning algorithms for Nvidia Graphics Processing Units (GPUs), but the effectiveness and efficiency of these approaches on AMD devices have hardly been studied. This paper aims to…