Related papers: Autotuning Apache TVM-based Scientific Application…

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new…

Machine Learning · Computer Science 2018-10-09 Tianqi Chen , Thierry Moreau , Ziheng Jiang , Lianmin Zheng , Eddie Yan , Meghan Cowan , Haichen Shen , Leyuan Wang , Yuwei Hu , Luis Ceze , Carlos Guestrin , Arvind Krishnamurthy

Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM

We implemented and optimized matrix multiplications between dense and block-sparse matrices on CUDA. We leveraged TVM, a deep learning compiler, to explore the schedule space of the operation and generate efficient CUDA code. With the…

Mathematical Software · Computer Science 2020-07-28 Zijing Gu

ATiM: Autotuning Tensor Programs for Processing-in-DRAM

Processing-in-DRAM (DRAM-PIM) has emerged as a promising technology for accelerating memory-intensive operations in modern applications, such as Large Language Models (LLMs). Despite its potential, current software stacks for DRAM-PIM face…

Hardware Architecture · Computer Science 2025-06-03 Yongwon Shin , Dookyung Kang , Hyojin Sung

Agile Autotuning of a Transprecision Tensor Accelerator Overlay for TVM Compiler Stack

Specialized accelerators for tensor-operations, such as blocked-matrix operations and multi-dimensional convolutions, have been emerged as powerful architecture choices for high-performance Deep-Learning computing. The rapid development of…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-24 Dionysios Diamantopoulos , Burkhard Ringlein , Mitra Purandare , Gagandeep Singh , Christoph Hagleitner

Understanding Cache Boundness of ML Operators on ARM Processors

Machine Learning compilers like TVM allow a fast and flexible deployment on embedded CPUs. This enables the use of non-standard operators, which are common in ML compression techniques. However, it is necessary to understand the limitations…

Hardware Architecture · Computer Science 2021-02-02 Bernhard Klein , Christoph Gratl , Manfred Mücke , Holger Fröning

Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM

We explore the utilization of the Apache TVM open source framework to automatically generate a family of algorithms that follow the approach taken by popular linear algebra libraries, such as GotoBLAS2, BLIS and OpenBLAS, in order to obtain…

Computation and Language · Computer Science 2023-11-01 Guillermo Alaejos , Adrián Castelló , Pedro Alonso-Jordá , Francisco D. Igual , Héctor Martínez , Enrique S. Quintana-Ortí

A High-Level Compiler Integration Approach for Deep Learning Accelerators Supporting Abstraction and Optimization

The growing adoption of domain-specific architectures in edge computing platforms for deep learning has highlighted the efficiency of hardware accelerators. However, integrating custom accelerators into modern machine learning (ML)…

Machine Learning · Computer Science 2025-07-08 Samira Ahmadifarsani , Daniel Mueller-Gritschneder , Ulf Schlichtmann

AutoTSMM: An Auto-tuning Framework for Building High-Performance Tall-and-Skinny Matrix-Matrix Multiplication on CPUs

In recent years, general matrix-matrix multiplication with non-regular-shaped input matrices has been widely used in many applications like deep learning and has drawn more and more attention. However, conventional implementations are not…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-24 Chendi Li , Haipeng Jia , Hang Cao , Jianyu Yao , Boqian Shi , Chunyang Xiang , Jinbo Sun , Pengqi Lu , Yunquan Zhang

Autotuning PolyBench Benchmarks with LLVM Clang/Polly Loop Optimization Pragmas Using Bayesian Optimization

An autotuning is an approach that explores a search space of possible implementations/configurations of a kernel or an application by selecting and evaluating a subset of implementations/configurations on a target platform and/or use models…

Performance · Computer Science 2020-10-19 Xingfu Wu , Michael Kruse , Prasanna Balaprakash , Hal Finkel , Paul Hovland , Valerie Taylor , Mary Hall

Large Language Models for Human-Machine Collaborative Particle Accelerator Tuning through Natural Language

Autonomous tuning of particle accelerators is an active and challenging field of research with the goal of enabling novel accelerator technologies cutting-edge high-impact applications, such as physics discovery, cancer research and…

Computation and Language · Computer Science 2024-05-16 Jan Kaiser , Annika Eichler , Anne Lauscher

Automated Algorithm Design for Auto-Tuning Optimizers

Automatic performance tuning (auto-tuning) is essential for optimizing high-performance applications, where vast and irregular search spaces make manual exploration infeasible. While auto-tuners traditionally rely on classical approaches…

Machine Learning · Computer Science 2026-04-01 Floris-Jan Willemsen , Niki van Stein , Ben van Werkhoven

Autotuning PolyBench Benchmarks with LLVM Clang/Polly Loop Optimization Pragmas Using Bayesian Optimization (extended version)

In this paper, we develop a ytopt autotuning framework that leverages Bayesian optimization to explore the parameter space search and compare four different supervised learning methods within Bayesian optimization and evaluate their…

Machine Learning · Computer Science 2021-04-28 Xingfu Wu , Michael Kruse , Prasanna Balaprakash , Hal Finkel , Paul Hovland , Valerie Taylor , Mary Hall

CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization

Bayesian optimization is a powerful method for automating tuning of compilers. The complex landscape of autotuning provides a myriad of rarely considered structural challenges for black-box optimizers, and the lack of standardized…

Machine Learning · Computer Science 2025-04-09 Jacob O. Tørring , Carl Hvarfner , Luigi Nardi , Magnus Själander

Integration of a systolic array based hardware accelerator into a DNN operator auto-tuning framework

The deployment of neural networks on heterogeneous SoCs coupled with custom accelerators is a challenging task because of the lack of end-to-end software tools provided for these systems. Moreover, the already available low level schedules…

Machine Learning · Computer Science 2024-06-11 F. N. Peccia , O. Bringmann

ALCOP: Automatic Load-Compute Pipelining in Deep Learning Compiler for AI-GPUs

Pipelining between data loading and computation is a critical tensor program optimization for GPUs. In order to unleash the high performance of latest GPUs, we must perform a synergetic optimization of multi-stage pipelining across the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-09 Guyue Huang , Yang Bai , Liu Liu , Yuke Wang , Bei Yu , Yufei Ding , Yuan Xie

TAMM: Tensor Algebra for Many-body Methods

Tensor contraction operations in computational chemistry consume significant fractions of computing time on large-scale computing platforms. The widespread use of tensor contractions between large multi-dimensional tensors in describing…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-11 Erdal Mutlu , Ajay Panyala , Nitin Gawande , Abhishek Bagusetty , Jinsung Kim , Karol Kowalski , Nicholas Bauman , Bo Peng , Jiri Brabec , Sriram Krishnamoorthy

An Autotuning-based Optimization Framework for Mixed-kernel SVM Classifications in Smart Pixel Datasets and Heterojunction Transistors

Support Vector Machine (SVM) is a state-of-the-art classification method widely used in science and engineering due to its high accuracy, its ability to deal with high dimensional data, and its flexibility in modeling diverse sources of…

Machine Learning · Computer Science 2024-09-30 Xingfu Wu , Tupendra Oli , Justin H. Qian , Valerie Taylor , Mark C. Hersam , Vinod K. Sangwan

Optimizing Sparse Linear Algebra Through Automatic Format Selection and Machine Learning

Sparse matrices are an integral part of scientific simulations. As hardware evolves new sparse matrix storage formats are proposed aiming to exploit optimizations specific to the new hardware. In the era of heterogeneous computing, users…

Machine Learning · Computer Science 2023-03-10 Christodoulos Stylianou , Michele Weiland

SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs

Large language models have high compute, latency, and memory requirements. While specialized accelerators such as GPUs and TPUs typically run these workloads, CPUs are more widely available and consume less energy. Accelerating LLMs with…

Machine Learning · Computer Science 2025-02-19 Ahmed F. AbouElhamayed , Jordan Dotzel , Yash Akhauri , Chi-Chih Chang , Sameh Gobriel , J. Pablo Muñoz , Vui Seng Chua , Nilesh Jain , Mohamed S. Abdelfattah

Tensor Program Optimization for the RISC-V Vector Extension Using Probabilistic Programs

RISC-V provides a flexible and scalable platform for applications ranging from embedded devices to high-performance computing clusters. Particularly, its RISC-V Vector Extension (RVV) becomes of interest for the acceleration of AI…

Machine Learning · Computer Science 2025-08-20 Federico Nicolas Peccia , Frederik Haxel , Oliver Bringmann