Related papers: Autotuning Apache TVM-based Scientific Application…
There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new…
We implemented and optimized matrix multiplications between dense and block-sparse matrices on CUDA. We leveraged TVM, a deep learning compiler, to explore the schedule space of the operation and generate efficient CUDA code. With the…
Processing-in-DRAM (DRAM-PIM) has emerged as a promising technology for accelerating memory-intensive operations in modern applications, such as Large Language Models (LLMs). Despite its potential, current software stacks for DRAM-PIM face…
Specialized accelerators for tensor-operations, such as blocked-matrix operations and multi-dimensional convolutions, have been emerged as powerful architecture choices for high-performance Deep-Learning computing. The rapid development of…
Machine Learning compilers like TVM allow a fast and flexible deployment on embedded CPUs. This enables the use of non-standard operators, which are common in ML compression techniques. However, it is necessary to understand the limitations…
We explore the utilization of the Apache TVM open source framework to automatically generate a family of algorithms that follow the approach taken by popular linear algebra libraries, such as GotoBLAS2, BLIS and OpenBLAS, in order to obtain…
The growing adoption of domain-specific architectures in edge computing platforms for deep learning has highlighted the efficiency of hardware accelerators. However, integrating custom accelerators into modern machine learning (ML)…
In recent years, general matrix-matrix multiplication with non-regular-shaped input matrices has been widely used in many applications like deep learning and has drawn more and more attention. However, conventional implementations are not…
An autotuning is an approach that explores a search space of possible implementations/configurations of a kernel or an application by selecting and evaluating a subset of implementations/configurations on a target platform and/or use models…
Autonomous tuning of particle accelerators is an active and challenging field of research with the goal of enabling novel accelerator technologies cutting-edge high-impact applications, such as physics discovery, cancer research and…
Automatic performance tuning (auto-tuning) is essential for optimizing high-performance applications, where vast and irregular search spaces make manual exploration infeasible. While auto-tuners traditionally rely on classical approaches…
In this paper, we develop a ytopt autotuning framework that leverages Bayesian optimization to explore the parameter space search and compare four different supervised learning methods within Bayesian optimization and evaluate their…
Bayesian optimization is a powerful method for automating tuning of compilers. The complex landscape of autotuning provides a myriad of rarely considered structural challenges for black-box optimizers, and the lack of standardized…
The deployment of neural networks on heterogeneous SoCs coupled with custom accelerators is a challenging task because of the lack of end-to-end software tools provided for these systems. Moreover, the already available low level schedules…
Pipelining between data loading and computation is a critical tensor program optimization for GPUs. In order to unleash the high performance of latest GPUs, we must perform a synergetic optimization of multi-stage pipelining across the…
Tensor contraction operations in computational chemistry consume significant fractions of computing time on large-scale computing platforms. The widespread use of tensor contractions between large multi-dimensional tensors in describing…
Support Vector Machine (SVM) is a state-of-the-art classification method widely used in science and engineering due to its high accuracy, its ability to deal with high dimensional data, and its flexibility in modeling diverse sources of…
Sparse matrices are an integral part of scientific simulations. As hardware evolves new sparse matrix storage formats are proposed aiming to exploit optimizations specific to the new hardware. In the era of heterogeneous computing, users…
Large language models have high compute, latency, and memory requirements. While specialized accelerators such as GPUs and TPUs typically run these workloads, CPUs are more widely available and consume less energy. Accelerating LLMs with…
RISC-V provides a flexible and scalable platform for applications ranging from embedded devices to high-performance computing clusters. Particularly, its RISC-V Vector Extension (RVV) becomes of interest for the acceleration of AI…