English
Related papers

Related papers: DGEMM on Integer Matrix Multiplication Unit

200 papers

As the demand for AI computation rapidly increases, more hardware is being developed to efficiently perform the low-precision matrix multiplications required by such workloads. However, these operations are generally not directly applicable…

Performance · Computer Science 2025-09-26 Daichi Mukunoki

In this paper, we propose a method for emulating double-precision general matrix--matrix multiplication (DGEMM), a fundamental and performance-critical kernel in many high-performance computing applications. Ozaki-I and Ozaki-II are…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-07 Yuki Uchino , Katsuhisa Ozaki , Toshiyuki Imamura

This study was aimed at simultaneously achieving sufficient accuracy and high performance for general matrix multiplications. Recent architectures, such as NVIDIA GPUs, feature high-performance units designed for low-precision matrix…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-29 Yuki Uchino , Katsuhisa Ozaki , Toshiyuki Imamura

This paper addresses emulation algorithms for matrix multiplication. General Matrix-Matrix Multiplication (GEMM), a fundamental operation in the Basic Linear Algebra Subprograms (BLAS), is typically optimized for specific hardware…

Mathematical Software · Computer Science 2025-04-29 Katsuhisa Ozaki , Yuki Uchino , Toshiyuki Imamura

Modern computing architectures feature low-precision matrix multiplication units that achieve substantially higher throughput than their high-precision counterparts. Motivated by this architectural trend, the emulation of high-precision…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-18 Yuki Uchino , Qianxiang Ma , Toshiyuki Imamura , Katsuhisa Ozaki , Patrick Lars Gutsche

The Ozaki-II scheme is an emulation method that leverages the Chinese Remainder Theorem to compute high-precision matrix multiplication via a sequence of low-precision matrix multiplications. In this scheme, the attainable numerical…

Numerical Analysis · Mathematics 2026-02-04 Yuki Uchino , Katsuhisa Ozaki , Toshiyuki Imamura

Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as TensorCore Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4x4 or…

Performance · Computer Science 2019-11-26 Abdul Dakkak , Cheng Li , Isaac Gelado , Jinjun Xiong , Wen-mei Hwu

Recent architectures integrate high-performance and power-efficient matrix engines. These engines demonstrate remarkable performance in low-precision matrix multiplication, which is crucial in deep learning. Several techniques have been…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-13 Yuki Uchino , Katsuhisa Ozaki , Toshiyuki Imamura

The rapid growth of artificial intelligence (AI) has made low-precision formats such as FP16, FP8, and, most recently, block-scaled FP4 the primary focus of modern GPUs, where Tensor Cores now deliver orders-of-magnitude higher throughput…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-19 Angelika Schwarz , Anton Anders , Cole Brower , Harun Bayraktar , John Gunnels , Kate Clark , RuQing G. Xu , Samuel Rodriguez , Sebastien Cayrols , Paweł Tabaszewski , Victor Podlozhnyuk

Matrix multiplication is the bedrock in Deep Learning inference application. When it comes to hardware acceleration on edge computing devices, matrix multiplication often takes up a great majority of the time. To achieve better performance…

Machine Learning · Computer Science 2021-10-12 Yuyang Zhang , Dik Hin Leung , Min Guo , Yijia Xiao , Haoyue Liu , Yunfei Li , Jiyuan Zhang , Guan Wang , Zhen Chen

Optimized multiple precision basic linear computation, especially matrix multiplication, is crucial for solving ill-conditioned problems. The recently proposed Ozaki scheme, which implements accurate matrix multiplication using existing…

Numerical Analysis · Mathematics 2023-01-26 Taiga Utsugiri , Tomonori Kouya

We have repurposed Google Tensor Processing Units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers. The TPUs' fast inter-core interconnects (ICI)s, physically…

Computational Physics · Physics 2022-09-14 Adam G. M. Lewis , Jackson Beall , Martin Ganahl , Markus Hauru , Shrestha Basu Mallick , Guifre Vidal

Efficient multiple precision linear numerical computation libraries such as MPLAPACK are critical in dealing with ill-conditioned problems. Specifically, there are optimization methods for matrix multiplication, such as the Strassen…

Numerical Analysis · Mathematics 2023-07-13 Tomonori Kouya

In recent decades, High Performance Computing (HPC) has undergone significant enhancements, particularly in the realm of hardware platforms, aimed at delivering increased processing power while keeping power consumption within reasonable…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-03 S. -Kazem Shekofteh , Christian Alles , Nils Kochendörfer , Holger Fröning

Matrix multiplication is a fundamental operation in both training of neural networks and inference. To accelerate matrix multiplication, Graphical Processing Units (GPUs) provide it implemented in hardware. Due to the increased throughput…

Mathematical Software · Computer Science 2026-04-07 Faizan A. Khattak , Mantas Mikaitis

Deep learning training involves a large number of operations, which are dominated by high dimensionality Matrix-Vector Multiplies (MVMs). This has motivated hardware accelerators to enhance compute efficiency, but where data movement and…

Systems and Control · Electrical Eng. & Systems 2022-07-07 Christopher Grimm , Naveen Verma

To respond to the need of efficient training and inference of deep neural networks, a plethora of domain-specific hardware architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature…

Data Structures and Algorithms · Computer Science 2020-07-10 Rezaul Chowdhury , Francesco Silvestri , Flavio Vella

The exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-09 Shigang Li , Kazuki Osawa , Torsten Hoefler

Emerging deep learning workloads urgently need fast general matrix multiplication (GEMM). To meet such demand, one of the critical features of machine-learning-specific accelerators such as NVIDIA Tensor Cores, AMD Matrix Cores, and Google…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-13 Bo Fang , Xinyi Li , Harvey Dam , Cheng Tan , Siva Kumar Sastry Hari , Timothy Tsai , Ignacio Laguna , Dingwen Tao , Ganesh Gopalakrishnan , Prashant Nair , Kevin Barker , Ang Li

General Matrix Multiplication (GEMM) is a critical operation underpinning a wide range of applications in high-performance computing (HPC) and artificial intelligence (AI). The emergence of hardware optimized for low-precision arithmetic…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-21 Qiao Zhang , Rabab Alomairy , Dali Wang , Zhuowei Gu , Qinglei Cao
‹ Prev 1 2 3 10 Next ›