Related papers: DGEMM on Integer Matrix Multiplication Unit

DGEMM without FP64 Arithmetic - Using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

As the demand for AI computation rapidly increases, more hardware is being developed to efficiently perform the low-precision matrix multiplications required by such workloads. However, these operations are generally not directly applicable…

Performance · Computer Science 2025-09-26 Daichi Mukunoki

Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization

In this paper, we propose a method for emulating double-precision general matrix--matrix multiplication (DGEMM), a fundamental and performance-critical kernel in many high-performance computing applications. Ozaki-I and Ozaki-II are…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-07 Yuki Uchino , Katsuhisa Ozaki , Toshiyuki Imamura

Performance Enhancement of the Ozaki Scheme on Integer Matrix Multiplication Unit

This study was aimed at simultaneously achieving sufficient accuracy and high performance for general matrix multiplications. Recent architectures, such as NVIDIA GPUs, feature high-performance units designed for low-precision matrix…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-29 Yuki Uchino , Katsuhisa Ozaki , Toshiyuki Imamura

Ozaki Scheme II: A GEMM-oriented emulation of floating-point matrix multiplication using an integer modular technique

This paper addresses emulation algorithms for matrix multiplication. General Matrix-Matrix Multiplication (GEMM), a fundamental operation in the Basic Linear Algebra Subprograms (BLAS), is typically optimized for specific hardware…

Mathematical Software · Computer Science 2025-04-29 Katsuhisa Ozaki , Yuki Uchino , Toshiyuki Imamura

Emulation of Complex Matrix Multiplication based on the Chinese Remainder Theorem

Modern computing architectures feature low-precision matrix multiplication units that achieve substantially higher throughput than their high-precision counterparts. Motivated by this architectural trend, the emulation of high-precision…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-18 Yuki Uchino , Qianxiang Ma , Toshiyuki Imamura , Katsuhisa Ozaki , Patrick Lars Gutsche

Error Analysis of Matrix Multiplication Emulation Using Ozaki-II Scheme

The Ozaki-II scheme is an emulation method that leverages the Chinese Remainder Theorem to compute high-precision matrix multiplication via a sequence of low-precision matrix multiplications. In this scheme, the attainable numerical…

Numerical Analysis · Mathematics 2026-02-04 Yuki Uchino , Katsuhisa Ozaki , Toshiyuki Imamura

Accelerating Reduction and Scan Using Tensor Core Units

Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as TensorCore Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4x4 or…

Performance · Computer Science 2019-11-26 Abdul Dakkak , Cheng Li , Isaac Gelado , Jinjun Xiong , Wen-mei Hwu

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

Recent architectures integrate high-performance and power-efficient matrix engines. These engines demonstrate remarkable performance in low-precision matrix multiplication, which is crucial in deep learning. Several techniques have been…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-13 Yuki Uchino , Katsuhisa Ozaki , Toshiyuki Imamura

Guaranteed DGEMM Accuracy While Using Reduced Precision Tensor Cores Through Extensions of the Ozaki Scheme

The rapid growth of artificial intelligence (AI) has made low-precision formats such as FP16, FP8, and, most recently, block-scaled FP4 the primary focus of modern GPUs, where Tensor Cores now deliver orders-of-magnitude higher throughput…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-19 Angelika Schwarz , Anton Anders , Cole Brower , Harun Bayraktar , John Gunnels , Kate Clark , RuQing G. Xu , Samuel Rodriguez , Sebastien Cayrols , Paweł Tabaszewski , Victor Podlozhnyuk

A Deep Learning Inference Scheme Based on Pipelined Matrix Multiplication Acceleration Design and Non-uniform Quantization

Matrix multiplication is the bedrock in Deep Learning inference application. When it comes to hardware acceleration on edge computing devices, matrix multiplication often takes up a great majority of the time. To achieve better performance…

Machine Learning · Computer Science 2021-10-12 Yuyang Zhang , Dik Hin Leung , Min Guo , Yijia Xiao , Haoyue Liu , Yunfei Li , Jiyuan Zhang , Guan Wang , Zhen Chen

Acceleration of Multiple Precision Matrix Multiplication using Ozaki scheme

Optimized multiple precision basic linear computation, especially matrix multiplication, is crucial for solving ill-conditioned problems. The recently proposed Ozaki scheme, which implements accurate matrix multiplication using existing…

Numerical Analysis · Mathematics 2023-01-26 Taiga Utsugiri , Tomonori Kouya

Large Scale Distributed Linear Algebra With Tensor Processing Units

We have repurposed Google Tensor Processing Units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers. The TPUs' fast inter-core interconnects (ICI)s, physically…

Computational Physics · Physics 2022-09-14 Adam G. M. Lewis , Jackson Beall , Martin Ganahl , Markus Hauru , Shrestha Basu Mallick , Guifre Vidal

Acceleration of complex matrix multiplication using arbitrary precision floating-point arithmetic

Efficient multiple precision linear numerical computation libraries such as MPLAPACK are critical in dealing with ill-conditioned problems. Specifically, there are optimization methods for matrix multiplication, such as the Strassen…

Numerical Analysis · Mathematics 2023-07-13 Tomonori Kouya

On Performance Analysis of Graphcore IPUs: Analyzing Squared and Skewed Matrix Multiplication

In recent decades, High Performance Computing (HPC) has undergone significant enhancements, particularly in the realm of hardware platforms, aimed at delivering increased processing power while keeping power consumption within reasonable…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-03 S. -Kazem Shekofteh , Christian Alles , Nils Kochendörfer , Holger Fröning

Accurate Models of NVIDIA Tensor Cores

Matrix multiplication is a fundamental operation in both training of neural networks and inference. To accelerate matrix multiplication, Graphical Processing Units (GPUs) provide it implemented in hardware. Due to the increased throughput…

Mathematical Software · Computer Science 2026-04-07 Faizan A. Khattak , Mantas Mikaitis

Neural Network Training on In-memory-computing Hardware with Radix-4 Gradients

Deep learning training involves a large number of operations, which are dominated by high dimensionality Matrix-Vector Multiplies (MVMs). This has motivated hardware accelerators to enhance compute efficiency, but where data movement and…

Systems and Control · Electrical Eng. & Systems 2022-07-07 Christopher Grimm , Naveen Verma

A Computational Model for Tensor Core Units

To respond to the need of efficient training and inference of deep neural networks, a plethora of domain-specific hardware architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature…

Data Structures and Algorithms · Computer Science 2020-07-10 Rezaul Chowdhury , Francesco Silvestri , Flavio Vella

Efficient Quantized Sparse Matrix Operations on Tensor Cores

The exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-09 Shigang Li , Kazuki Osawa , Torsten Hoefler

MPGemmFI: A Fault Injection Technique for Mixed Precision GEMM in ML Applications

Emerging deep learning workloads urgently need fast general matrix multiplication (GEMM). To meet such demand, one of the critical features of machine-learning-specific accelerators such as NVIDIA Tensor Cores, AMD Matrix Cores, and Google…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-13 Bo Fang , Xinyi Li , Harvey Dam , Cheng Tan , Siva Kumar Sastry Hari , Timothy Tsai , Ignacio Laguna , Dingwen Tao , Ganesh Gopalakrishnan , Prashant Nair , Kevin Barker , Ang Li

Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach

General Matrix Multiplication (GEMM) is a critical operation underpinning a wide range of applications in high-performance computing (HPC) and artificial intelligence (AI). The emergence of hardware optimized for low-precision arithmetic…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-21 Qiao Zhang , Rabab Alomairy , Dali Wang , Zhuowei Gu , Qinglei Cao