Related papers: Optimizing Bit-Serial Matrix Multiplication for Re…

BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing

Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix…

Hardware Architecture · Computer Science 2018-06-26 Yaman Umuroglu , Lahiru Rasnayake , Magnus Sjalander

Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis

Data movement is the dominating factor affecting performance and energy in modern computing systems. Consequently, many algorithms have been developed to minimize the number of I/O operations for common computing patterns. Matrix…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-26 Johannes de Fine Licht , Grzegorz Kwasniewski , Torsten Hoefler

bitSMM: A bit-Serial Matrix Multiplication Accelerator

Neural-network (NN) inference is increasingly present on-board spacecraft to reduce downlink bandwidth and enable timely decision making. However, the power and reliability constraints of space missions limit the applicability of many…

Hardware Architecture · Computer Science 2026-03-17 Pedro Antunes , Artur Podobas

Accelerating Matrix Multiplication: A Performance Comparison Between Multi-Core CPU and GPU

Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-30 Mufakir Qamar Ansari , Mudabir Qamar Ansari

Generic and Universal Parallel Matrix Summation with a Flexible Compression Goal for Xilinx FPGAs

Bit matrix compression is a highly relevant operation in computer arithmetic. Essentially being a multi-operand addition, it is the key operation behind fast multiplication and many higher-level operations such as multiply-accumulate, the…

Hardware Architecture · Computer Science 2018-06-22 Thomas B. Preußer

Binary matrix factorization on special purpose hardware

Many fundamental problems in data mining can be reduced to one or more NP-hard combinatorial optimization problems. Recent advances in novel technologies such as quantum and quantum-inspired hardware promise a substantial speedup for…

Machine Learning · Computer Science 2022-01-10 Osman Asif Malik , Hayato Ushijima-Mwesigwa , Arnab Roy , Avradip Mandal , Indradeep Ghosh

CUTEv2: Unified and Configurable Matrix Extension for Diverse CPU Architectures with Minimal Design Overhead

Matrix extensions have emerged as an essential feature in modern CPUs to address the surging demands of AI workloads. However, existing designs often incur substantial hardware and software design overhead. Tight coupling with the CPU…

Hardware Architecture · Computer Science 2026-04-14 Jinpeng Ye , Chongxi Wang , Wenqing Li , Bin Yuan , Shiyi Wang , Fenglu Zhang , Junyu Yue , Jianan Xie , Yunhao Ye , Haoyu Deng , Yingkun Zhou , Xin Cheng , Fuxin Zhang , Jian Wang

Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach

General Matrix Multiplication (GEMM) is a critical operation underpinning a wide range of applications in high-performance computing (HPC) and artificial intelligence (AI). The emergence of hardware optimized for low-precision arithmetic…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-21 Qiao Zhang , Rabab Alomairy , Dali Wang , Zhuowei Gu , Qinglei Cao

Performance Engineering for Real and Complex Tall & Skinny Matrix Multiplication Kernels on GPUs

General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which…

Mathematical Software · Computer Science 2020-06-25 Dominik Ernst , Georg Hager , Jonas Thies , Gerhard Wellein

Run-Time-Reconfigurable Multi-Precision Floating-Point Matrix Multiplier Intellectual Property Core on FPGA

In todays world, high-power computing applications such as image processing, digital signal processing, graphics, and robotics require enormous computing power. These applications use matrix operations, especially matrix multiplication.…

Hardware Architecture · Computer Science 2019-10-29 Arish S , R. K. Sharma

High Throughput Matrix-Matrix Multiplication between Asymmetric Bit-Width Operands

Matrix multiplications between asymmetric bit-width operands, especially between 8- and 4-bit operands are likely to become a fundamental kernel of many important workloads including neural networks and machine learning. While existing SIMD…

Machine Learning · Computer Science 2020-08-04 Dibakar Gope , Jesse Beu , Matthew Mattina

BLISlab: A Sandbox for Optimizing GEMM

Matrix-matrix multiplication is a fundamental operation of great importance to scientific computing and, increasingly, machine learning. It is a simple enough concept to be introduced in a typical high school algebra course yet in practice…

Mathematical Software · Computer Science 2016-09-02 Jianyu Huang , Robert A. van de Geijn

Supporting mixed-datatype matrix multiplication within the BLIS framework

We approach the problem of implementing mixed-datatype support within the general matrix multiplication (GEMM) operation of the BLIS framework, whereby each matrix operand A, B, and C may be stored as single- or double-precision real or…

Mathematical Software · Computer Science 2019-05-03 Field G. Van Zee , Devangi N. Parikh , Robert A. van de Geijn

Quantum Computing for MIMO Beam Selection Problem: Model and Optical Experimental Solution

Massive multiple-input multiple-output (MIMO) has gained widespread popularity in recent years due to its ability to increase data rates, improve signal quality, and provide better coverage in challenging environments. In this paper, we…

Networking and Internet Architecture · Computer Science 2023-10-31 Yuhong Huang , Wenxin Li , Chengkang Pan , Shuai Hou , Xian Lu , Chunfeng Cui , Jingwei Wen , Jiaqi Xu , Chongyu Cao , Yin Ma , Hai Wei , Kai Wen

Performance Analysis of Matrix Multiplication for Deep Learning on the Edge

The devices designed for the Internet-of-Things encompass a large variety of distinct processor architectures, forming a highly heterogeneous zoo. In order to tackle this, we employ a simulator to estimate the performance of the…

Hardware Architecture · Computer Science 2024-03-13 Cristian Ramírez , Adrián Castelló , Héctor Martínez , Enrique S. Quintana-Ortí

An efficient algorithm for multiuser sum-rate maximization of large-scale active RIS-aided MIMO system

Active reconfigurable intelligent surface (RIS) is a new RIS architecture that can reflect and amplify communication signals. It can provide enhanced performance gain compared to the conventional passive RIS systems that can only reflect…

Information Theory · Computer Science 2024-01-12 Qian Zhang , Mingjie Shao , Qiang Li , Ju Liu

Exploring the Performance Improvement of Tensor Processing Engines through Transformation in the Bit-weight Dimension of MACs

General matrix-matrix multiplication (GEMM) is a cornerstone of AI computations, making tensor processing engines (TPEs) increasingly critical in GPUs and domain-specific architectures. Existing architectures primarily optimize dataflow or…

Hardware Architecture · Computer Science 2025-03-11 Qizhe Wu , Huawen Liang , Yuchen Gui , Zhichen Zeng , Zerong He , Linfeng Tao , Xiaotian Wang , Letian Zhao , Zhaoxi Zeng , Wei Yuan , Wei Wu , Xi Jin

Effect of Mixed Precision Computing on H-Matrix Vector Multiplication in BEM Analysis

Hierarchical Matrix (H-matrix) is an approximation technique which splits a target dense matrix into multiple submatrices, and where a selected portion of submatrices are low-rank approximated. The technique substantially reduces both time…

Mathematical Software · Computer Science 2019-11-04 Rise Ooi , Takeshi Iwashita , Takeshi Fukaya , Akihiro Ida , Rio Yokota

UPMEM Unleashed: Software Secrets for Speed

Developing kernels for Processing-In-Memory (PIM) platforms poses unique challenges in data management and parallel programming on limited processing units. Although software development kits (SDKs) for PIM, such as the UPMEM SDK, provide…

Hardware Architecture · Computer Science 2025-10-21 Krystian Chmielewski , Jarosław Ławnicki , Uladzislau Lukyanau , Tadeusz Kobus , Maciej Maciejewski

LightMat-HP: A Photonic-Electronic System for Accelerating General Matrix Multiplication With Configurable Precision

Matrix multiplication is a fundamental kernel in large-scale artificial intelligence and scientific computing, but its performance on conventional electronic accelerators is increasingly constrained by memory bandwidth and energy…

Emerging Technologies · Computer Science 2026-04-15 Hailong Gong , Haibo Zhang , Amanda S. Barnard , Mahbub Hassan , Matt Woolley , Rajkumar Buyya