English
Related papers

Related papers: Compiler-Level Matrix Multiplication Optimization …

200 papers

Quantization has emerged to be an effective way to significantly boost the performance of deep neural networks (DNNs) by utilizing low-bit computations. Despite having lower numerical precision, quantized DNNs are able to reduce both memory…

Machine Learning · Computer Science 2019-11-15 Wenlei Bao , Li-Wen Chang , Yang Chen , Ke Deng , Amit Agarwal , Emad Barsoum , Abe Taha

The GEneral Matrix Multiplication (GEMM) is one of the essential algorithms in scientific computing. Single-thread GEMM implementations are well-optimised with techniques like blocking and autotuning. However, due to the complexity of…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-15 Yufan Xia , Marco De La Pierre , Amanda S. Barnard , Giuseppe Maria Junior Barca

The growing adoption of domain-specific architectures in edge computing platforms for deep learning has highlighted the efficiency of hardware accelerators. However, integrating custom accelerators into modern machine learning (ML)…

Machine Learning · Computer Science 2025-07-08 Samira Ahmadifarsani , Daniel Mueller-Gritschneder , Ulf Schlichtmann

General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as machine learning and scientific computing, and an efficient GEMM implementation is essential for the performance of these systems. While…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-03 Shixun Wu , Yujia Zhai , Jinyang Liu , Jiajun Huang , Zizhe Jian , Bryan M. Wong , Zizhong Chen

Large matrix multiplication is a cornerstone of modern machine learning workloads, yet traditional approaches suffer from cubic computational complexity (e.g., $\mathcal{O}(n^3)$ for a matrix of size $n\times n$). We present Low-Rank GEMM,…

Performance · Computer Science 2025-11-25 Alfredo Metere

The generic matrix multiply (GEMM) function is the core element of high-performance linear algebra libraries used in many computationally-demanding digital signal processing (DSP) systems. We propose an acceleration technique for GEMM based…

Mathematical Software · Computer Science 2015-05-30 Davide Anastasia , Yiannis Andreopoulos

General Matrix Multiplication (GEMM) is a critical kernel in high-performance computing and deep learning. While modern architectures like ARM's Scalable Matrix Extension (SME) introduce dedicated hardware for matrix operations, existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-29 Chencheng Deng , Weiling Yang , Jianbin Fang , Dezun Dong

General matrix multiplication (GEMM) is a fundamental operation in deep learning (DL). With DL moving increasingly toward low precision, recent works have proposed novel unary GEMM designs as an alternative to conventional binary GEMM…

Hardware Architecture · Computer Science 2026-02-03 Prabhu Vellaisamy , Harideep Nair , Di Wu , Shawn Blanton , John Paul Shen

General Matrix Multiplication (GEMM) is a critical operation underpinning a wide range of applications in high-performance computing (HPC) and artificial intelligence (AI). The emergence of hardware optimized for low-precision arithmetic…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-21 Qiao Zhang , Rabab Alomairy , Dali Wang , Zhuowei Gu , Qinglei Cao

GEneral Matrix Multiply (GEMM) is a central operation in deep learning and corresponds to the largest chunk of the compute footprint. Therefore, improving its efficiency is an active topic of ongoing research. A popular strategy is the use…

Machine Learning · Computer Science 2024-03-13 Zhanpeng Zeng , Karthikeyan Sankaralingam , Vikas Singh

FPGAs are a promising platform for accelerating Deep Learning (DL) applications, due to their high performance, low power consumption, and reconfigurability. Recently, the leading FPGA vendors have enhanced their architectures to more…

Hardware Architecture · Computer Science 2024-04-18 Endri Taka , Dimitrios Gourounas , Andreas Gerstlauer , Diana Marculescu , Aman Arora

Many scientific computing problems can be reduced to Matrix-Matrix Multiplications (MMM), making the General Matrix Multiply (GEMM) kernels in the Basic Linear Algebra Subroutine (BLAS) of interest to the high-performance computing…

Hardware Architecture · Computer Science 2023-05-31 Louis Ledoux , Marc Casas

General Matrix Multiplication (GEMM) is a fundamental operation in many scientific workloads, signal processing, and particularly deep learning. It is often a bottleneck for performance and energy efficiency, especially in edge environments…

Hardware Architecture · Computer Science 2025-11-11 Ilias Papalamprou , Dimosthenis Masouros , Ioannis Loudaros , Francky Catthoor , Dimitrios Soudris

Emerging deep learning workloads urgently need fast general matrix multiplication (GEMM). To meet such demand, one of the critical features of machine-learning-specific accelerators such as NVIDIA Tensor Cores, AMD Matrix Cores, and Google…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-13 Bo Fang , Xinyi Li , Harvey Dam , Cheng Tan , Siva Kumar Sastry Hari , Timothy Tsai , Ignacio Laguna , Dingwen Tao , Ganesh Gopalakrishnan , Prashant Nair , Kevin Barker , Ang Li

The devices designed for the Internet-of-Things encompass a large variety of distinct processor architectures, forming a highly heterogeneous zoo. In order to tackle this, we employ a simulator to estimate the performance of the…

Hardware Architecture · Computer Science 2024-03-13 Cristian Ramírez , Adrián Castelló , Héctor Martínez , Enrique S. Quintana-Ortí

General Matrix Multiplication (GEMM) is a ubiquitous compute kernel in deep learning (DL). To support energy-efficient edge-native processing, new GEMM hardware units have been proposed that operate on unary encoded bitstreams using much…

Hardware Architecture · Computer Science 2024-12-25 Prabhu Vellaisamy , Harideep Nair , Joseph Finn , Manav Trivedi , Albert Chen , Anna Li , Tsung-Han Lin , Perry Wang , Shawn Blanton , John Paul Shen

There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new…

Generic matrix multiplication (GEMM) and one-dimensional convolution/cross-correlation (CONV) kernels often constitute the bulk of the compute- and memory-intensive processing within image/audio recognition and matching systems. We propose…

Multimedia · Computer Science 2014-11-12 Mohammad Ashraful Anam , Paul N. Whatmough , Yiannis Andreopoulos

The remarkable positive impact of Deep Neural Networks on many Artificial Intelligence (AI) tasks has led to the development of various high performance algorithms as well as specialized processors and accelerators. In this paper we address…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-16 Jie Lei , José Flich , Enrique S. Quintana-Ortí

Matrix multiplication (GEMM) is a core operation to numerous scientific applications. Traditional implementations of Strassen-like fast matrix multiplication (FMM) algorithms often do not perform well except for very large matrix sizes, due…

Mathematical Software · Computer Science 2016-11-04 Jianyu Huang , Leslie Rice , Devin A. Matthews , Robert A. van de Geijn
‹ Prev 1 2 3 10 Next ›