Related papers: Compiler-Level Matrix Multiplication Optimization …

NGEMM: Optimizing GEMM for Deep Learning via Compiler-based Techniques

Quantization has emerged to be an effective way to significantly boost the performance of deep neural networks (DNNs) by utilizing low-bit computations. Despite having lower numerical precision, quantized DNNs are able to reduce both memory…

Machine Learning · Computer Science 2019-11-15 Wenlei Bao , Li-Wen Chang , Yang Chen , Ke Deng , Amit Agarwal , Emad Barsoum , Abe Taha

A Machine Learning Approach Towards Runtime Optimisation of Matrix Multiplication

The GEneral Matrix Multiplication (GEMM) is one of the essential algorithms in scientific computing. Single-thread GEMM implementations are well-optimised with techniques like blocking and autotuning. However, due to the complexity of…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-15 Yufan Xia , Marco De La Pierre , Amanda S. Barnard , Giuseppe Maria Junior Barca

A High-Level Compiler Integration Approach for Deep Learning Accelerators Supporting Abstraction and Optimization

The growing adoption of domain-specific architectures in edge computing platforms for deep learning has highlighted the efficiency of hardware accelerators. However, integrating custom accelerators into modern machine learning (ML)…

Machine Learning · Computer Science 2025-07-08 Samira Ahmadifarsani , Daniel Mueller-Gritschneder , Ulf Schlichtmann

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as machine learning and scientific computing, and an efficient GEMM implementation is essential for the performance of these systems. While…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-03 Shixun Wu , Yujia Zhai , Jinyang Liu , Jiajun Huang , Zizhe Jian , Bryan M. Wong , Zizhong Chen

Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration

Large matrix multiplication is a cornerstone of modern machine learning workloads, yet traditional approaches suffer from cubic computational complexity (e.g., $\mathcal{O}(n^3)$ for a matrix of size $n\times n$). We present Low-Rank GEMM,…

Performance · Computer Science 2025-11-25 Alfredo Metere

Throughput-Distortion Computation Of Generic Matrix Multiplication: Toward A Computation Channel For Digital Signal Processing Systems

The generic matrix multiply (GEMM) function is the core element of high-performance linear algebra libraries used in many computationally-demanding digital signal processing (DSP) systems. We propose an acceleration technique for GEMM based…

Mathematical Software · Computer Science 2015-05-30 Davide Anastasia , Yiannis Andreopoulos

Demystifying ARM SME to Optimize General Matrix Multiplications

General Matrix Multiplication (GEMM) is a critical kernel in high-performance computing and deep learning. While modern architectures like ARM's Scalable Matrix Extension (SME) introduce dedicated hardware for matrix operations, existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-29 Chencheng Deng , Weiling Yang , Jianbin Fang , Dezun Dong

Exploration of Unary Arithmetic-Based Matrix Multiply Units for Low Precision DL Accelerators

General matrix multiplication (GEMM) is a fundamental operation in deep learning (DL). With DL moving increasingly toward low precision, recent works have proposed novel unary GEMM designs as an alternative to conventional binary GEMM…

Hardware Architecture · Computer Science 2026-02-03 Prabhu Vellaisamy , Harideep Nair , Di Wu , Shawn Blanton , John Paul Shen

Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach

General Matrix Multiplication (GEMM) is a critical operation underpinning a wide range of applications in high-performance computing (HPC) and artificial intelligence (AI). The emergence of hardware optimized for low-precision arithmetic…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-21 Qiao Zhang , Rabab Alomairy , Dali Wang , Zhuowei Gu , Qinglei Cao

IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers

GEneral Matrix Multiply (GEMM) is a central operation in deep learning and corresponds to the largest chunk of the compute footprint. Therefore, improving its efficiency is an active topic of ongoing research. A popular strategy is the use…

Machine Learning · Computer Science 2024-03-13 Zhanpeng Zeng , Karthikeyan Sankaralingam , Vikas Singh

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

FPGAs are a promising platform for accelerating Deep Learning (DL) applications, due to their high performance, low power consumption, and reconfigurability. Recently, the leading FPGA vendors have enhanced their architectures to more…

Hardware Architecture · Computer Science 2024-04-18 Endri Taka , Dimitrios Gourounas , Andreas Gerstlauer , Diana Marculescu , Aman Arora

Open-Source GEMM Hardware Kernels Generator: Toward Numerically-Tailored Computations

Many scientific computing problems can be reduced to Matrix-Matrix Multiplications (MMM), making the General Matrix Multiply (GEMM) kernels in the Basic Linear Algebra Subroutine (BLAS) of interest to the high-performance computing…

Hardware Architecture · Computer Science 2023-05-31 Louis Ledoux , Marc Casas

Optimizing GEMM for Energy and Performance on Versal ACAP Architectures

General Matrix Multiplication (GEMM) is a fundamental operation in many scientific workloads, signal processing, and particularly deep learning. It is often a bottleneck for performance and energy efficiency, especially in edge environments…

Hardware Architecture · Computer Science 2025-11-11 Ilias Papalamprou , Dimosthenis Masouros , Ioannis Loudaros , Francky Catthoor , Dimitrios Soudris

MPGemmFI: A Fault Injection Technique for Mixed Precision GEMM in ML Applications

Emerging deep learning workloads urgently need fast general matrix multiplication (GEMM). To meet such demand, one of the critical features of machine-learning-specific accelerators such as NVIDIA Tensor Cores, AMD Matrix Cores, and Google…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-13 Bo Fang , Xinyi Li , Harvey Dam , Cheng Tan , Siva Kumar Sastry Hari , Timothy Tsai , Ignacio Laguna , Dingwen Tao , Ganesh Gopalakrishnan , Prashant Nair , Kevin Barker , Ang Li

Performance Analysis of Matrix Multiplication for Deep Learning on the Edge

The devices designed for the Internet-of-Things encompass a large variety of distinct processor architectures, forming a highly heterogeneous zoo. In order to tackle this, we employ a simulator to estimate the performance of the…

Hardware Architecture · Computer Science 2024-03-13 Cristian Ramírez , Adrián Castelló , Héctor Martínez , Enrique S. Quintana-Ortí

tubGEMM: Energy-Efficient and Sparsity-Effective Temporal-Unary-Binary Based Matrix Multiply Unit

General Matrix Multiplication (GEMM) is a ubiquitous compute kernel in deep learning (DL). To support energy-efficient edge-native processing, new GEMM hardware units have been proposed that operate on unary encoded bitstreams using much…

Hardware Architecture · Computer Science 2024-12-25 Prabhu Vellaisamy , Harideep Nair , Joseph Finn , Manav Trivedi , Albert Chen , Anna Li , Tsung-Han Lin , Perry Wang , Shawn Blanton , John Paul Shen

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new…

Machine Learning · Computer Science 2018-10-09 Tianqi Chen , Thierry Moreau , Ziheng Jiang , Lianmin Zheng , Eddie Yan , Meghan Cowan , Haichen Shen , Leyuan Wang , Yuwei Hu , Luis Ceze , Carlos Guestrin , Arvind Krishnamurthy

Precision-Energy-Throughput Scaling Of Generic Matrix Multiplication and Convolution Kernels Via Linear Projections

Generic matrix multiplication (GEMM) and one-dimensional convolution/cross-correlation (CONV) kernels often constitute the bulk of the compute- and memory-intensive processing within image/audio recognition and matching systems. We propose…

Multimedia · Computer Science 2014-11-12 Mohammad Ashraful Anam , Paul N. Whatmough , Yiannis Andreopoulos

Toward matrix multiplication for deep learning inference on the Xilinx Versal

The remarkable positive impact of Deep Neural Networks on many Artificial Intelligence (AI) tasks has led to the development of various high performance algorithms as well as specialized processors and accelerators. In this paper we address…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-16 Jie Lei , José Flich , Enrique S. Quintana-Ortí

Generating Families of Practical Fast Matrix Multiplication Algorithms

Matrix multiplication (GEMM) is a core operation to numerous scientific applications. Traditional implementations of Strassen-like fast matrix multiplication (FMM) algorithms often do not perform well except for very large matrix sizes, due…

Mathematical Software · Computer Science 2016-11-04 Jianyu Huang , Leslie Rice , Devin A. Matthews , Robert A. van de Geijn