Related papers: Generating GPU Compiler Heuristics using Reinforce…

Compilation Forking: A Fast and Flexible Way of Generating Data for Compiler-Internal Machine Learning Tasks

Compiler optimization decisions are often based on hand-crafted heuristics centered around a few established benchmark suites. Alternatively, they can be learned from feature and performance data produced during compilation. However,…

Programming Languages · Computer Science 2022-06-29 Raphael Mosaner , David Leopoldseder , Wolfgang Kisling , Lukas Stadler , Hanspeter Mössenböck

Benchmarking optimization algorithms for auto-tuning GPU kernels

Recent years have witnessed phenomenal growth in the application, and capabilities of Graphical Processing Units (GPUs) due to their high parallel computation power at relatively low cost. However, writing a computationally efficient GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-05 Richard Schoonhoven , Ben van Werkhoven , Kees Joost Batenburg

Autotuning GPU Kernels via Static and Predictive Analysis

Optimizing the performance of GPU kernels is challenging for both human programmers and code generators. For example, CUDA programmers must set thread and block parameters for a kernel, but might not have the intuition to make a good…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-06-30 Robert V. Lim , Boyana Norris , Allen D. Malony

Can We Learn Heuristics For Graphical Model Inference Using Reinforcement Learning?

Combinatorial optimization is frequently used in computer vision. For instance, in applications like semantic segmentation, human pose estimation and action recognition, programs are formulated for solving inference in Conditional Random…

Computer Vision and Pattern Recognition · Computer Science 2020-05-06 Safa Messaoud , Maghav Kumar , Alexander G. Schwing

Hexcute: A Compiler Framework for Automating Layout Synthesis in GPU Programs

Efficient GPU programming is crucial for achieving high performance in deep learning (DL) applications. The performance of GPU programs depends on how data is parallelized across threads and arranged within memory subsystems. The mapping…

Machine Learning · Computer Science 2026-01-30 Xiao Zhang , Yaoyao Ding , Bolin Sun , Yang Hu , Tatiana Shpeisman , Gennady Pekhimenko

A Tool for Automatically Suggesting Source-Code Optimizations for Complex GPU Kernels

Future computing systems, from handhelds to supercomputers, will undoubtedly be more parallel and heterogeneous than todays systems to provide more performance and energy efficiency. Thus, GPUs are increasingly being used to accelerate…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-18 Saeed Taheri , Apan Qasem , Martin Burtscher

Machine Learning-driven Autotuning of Graphics Processing Unit Accelerated Computational Fluid Dynamics for Enhanced Performance

Optimizing the performance of computational fluid dynamics (CFD) applications accelerated by graphics processing units (GPUs) is crucial for efficient simulations. In this study, we employed a machine learning-based autotuning technique to…

Performance · Computer Science 2024-02-21 Weicheng Xue , Christohper John Roy

Theoretical Foundations of GPU-Native Compilation for Rapid Code Iteration

Current AI code generation systems suffer from significant latency bottlenecks due to CPU-GPU data transfers during compilation, execution, and testing phases. We establish theoretical foundations for three complementary approaches to…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-15 Adilet Metinov , Gulida M. Kudakeeva , Gulnara D. Kabaeva

Compilation Techniques for Graph Algorithms on GPUs

The performance of graph programs depends highly on the algorithm, the size and structure of the input graphs, as well as the features of the underlying hardware. No single set of optimizations or one hardware platform works well across all…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-11 Ajay Brahmakshatriya , Yunming Zhang , Changwan Hong , Shoaib Kamil , Julian Shun , Saman Amarasinghe

A Learned Performance Model for Tensor Processing Units

Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for…

Performance · Computer Science 2021-03-19 Samuel J. Kaufman , Phitchaya Mangpo Phothilimthana , Yanqi Zhou , Charith Mendis , Sudip Roy , Amit Sabne , Mike Burrows

Compiler Auto-tuning through Multiple Phase Learning

Widely used compilers like GCC and LLVM usually have hundreds of optimizations controlled by optimization flags, which are enabled or disabled during compilation to improve runtime performance (e.g., small execution time) of the compiler…

Programming Languages · Computer Science 2023-05-01 Mingxuan Zhu , Dan Hao , Junjie Chen

Using hardware performance counters to speed up autotuning convergence on GPUs

Nowadays, GPU accelerators are commonly used to speed up general-purpose computing tasks on a variety of hardware. However, due to the diversity of GPU architectures and processed data, optimization of codes for a particular type of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-20 Jiří Filipovič , Jana Hozzová , Amin Nezarat , Jaroslav Oľha , Filip Petrovič

Learning Heuristics over Large Graphs via Deep Reinforcement Learning

There has been an increased interest in discovering heuristics for combinatorial problems on graphs through machine learning. While existing techniques have primarily focused on obtaining high-quality solutions, scalability to billion-sized…

Machine Learning · Computer Science 2020-12-04 Sahil Manchanda , Akash Mittal , Anuj Dhawan , Sourav Medya , Sayan Ranu , Ambuj Singh

GPU accelerated program synthesis: Enumerate semantics, not syntax!

Program synthesis is an umbrella term for generating programs and logical formulae from specifications. With the remarkable performance improvements that GPUs enable for deep learning, a natural question arose: can we also implement a…

Programming Languages · Computer Science 2025-04-29 Martin Berger , Nathanaël Fijalkow , Mojtaba Valizadeh

High Performance GPU Code Generation for Matrix-Matrix Multiplication using MLIR: Some Early Results

This report presents some early results on code generation targeting tensor cores on NVIDIA GPUs using the MLIR compiler infrastructure. The state-of-the-art in high-performance deep learning today is primarily driven by manually optimized…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-08-31 Navdeep Katel , Vivek Khandelwal , Uday Bondhugula

Reinforced Genetic Algorithm Learning for Optimizing Computation Graphs

We present a deep reinforcement learning approach to minimizing the execution cost of neural network computation graphs in an optimizing compiler. Unlike earlier learning-based works that require training the optimizer on the same graph to…

Machine Learning · Computer Science 2020-02-11 Aditya Paliwal , Felix Gimeno , Vinod Nair , Yujia Li , Miles Lubin , Pushmeet Kohli , Oriol Vinyals

Compiler-R1: Towards Agentic Compiler Auto-tuning with Reinforcement Learning

Compiler auto-tuning optimizes pass sequences to improve performance metrics such as Intermediate Representation (IR) instruction count. Although recent advances leveraging Large Language Models (LLMs) have shown promise in automating…

Machine Learning · Computer Science 2025-06-23 Haolin Pan , Hongyu Lin , Haoran Luo , Yang Liu , Kaichun Yao , Libo Zhang , Mingjie Xing , Yanjun Wu

On Designing GPU Algorithms with Applications to Mesh Refinement

We present a set of rules to guide the design of GPU algorithms. These rules are grounded on the principle of reducing waste in GPU utility to achieve good speed up. In accordance to these rules, we propose GPU algorithms for 2D…

Graphics · Computer Science 2020-07-02 Zhenghai Chen , Tiow-Seng Tan , Hong-Yang Ong

Learning Branch Probabilities in Compiler from Datacenter Workloads

Estimating the probability with which a conditional branch instruction is taken is an important analysis that enables many optimizations in modern compilers. When using Profile Guided Optimizations (PGO), compilers are able to make a good…

Machine Learning · Computer Science 2022-02-17 Easwaran Raman , Xinliang David Li

Learning Improvement Heuristics for Solving Routing Problems

Recent studies in using deep learning to solve routing problems focus on construction heuristics, the solutions of which are still far from optimality. Improvement heuristics have great potential to narrow this gap by iteratively refining a…

Artificial Intelligence · Computer Science 2020-05-12 Yaoxin Wu , Wen Song , Zhiguang Cao , Jie Zhang , Andrew Lim