Related papers: Accelerating Reduction and Scan Using Tensor Core …

A Computational Model for Tensor Core Units

To respond to the need of efficient training and inference of deep neural networks, a plethora of domain-specific hardware architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature…

Data Structures and Algorithms · Computer Science 2020-07-10 Rezaul Chowdhury , Francesco Silvestri , Flavio Vella

TCUDB: Accelerating Database with Tensor Processors

The emergence of novel hardware accelerators has powered the tremendous growth of machine learning in recent years. These accelerators deliver incomparable performance gains in processing high-volume matrix operators, particularly matrix…

Databases · Computer Science 2021-12-15 Yu-Ching Hu , Yuliang Li , Hung-Wei Tseng

Tensor Core Units (TCUs) are hardware accelerators developed for deep neural networks, which efficiently support the multiplication of two dense $\sqrt{m}\times \sqrt{m}$ matrices, where $m$ is a given hardware parameter. In this paper, we…

Data Structures and Algorithms · Computer Science 2020-06-24 Thomas D. Ahle , Francesco Silvestri

Large Scale Distributed Linear Algebra With Tensor Processing Units

We have repurposed Google Tensor Processing Units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers. The TPUs' fast inter-core interconnects (ICI)s, physically…

Computational Physics · Physics 2022-09-14 Adam G. M. Lewis , Jackson Beall , Martin Ganahl , Markus Hauru , Shrestha Basu Mallick , Guifre Vidal

Analyzing GPU Tensor Core Potential for Fast Reductions

The Nvidia GPU architecture has introduced new computing elements such as the \textit{tensor cores}, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate \textit{Deep…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-03-12 Roberto Carrasco , Raimundo Vega , Cristóbal A. Navarro

cuTeSpMM: Accelerating Sparse-Dense Matrix Multiplication using GPU Tensor Cores

Many recent GPUs feature matrix multiplication engines (aka Tensor Core Units or TCUs) that perform small fixed-size matrix-matrix products at very high throughput. They have been used very effectively to speed up dense matrix-matrix…

Performance · Computer Science 2025-11-25 Lizhi Xiang , Omid Asudeh , Gerald Sabin , Aravind Sukumaran-Rajam , P. Sadayappan

A Parallel Scan Algorithm in the Tensor Core Unit Model

We present a parallel scan (prefix sum) algorithm in the Tensor Core Unit (TCU) model of computation. The TCU model assumes that multiplication between two square matrices of constant size $s$ is a basic operation. In the $(s^2, \ell)$-TCU…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-28 Anastasios Zouzias , William F. McColl

Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

Sparse general matrix-matrix multiplication (spGEMM) is an essential component in many scientific and data analytics applications. However, the sparsity pattern of the input matrices and the interaction of their patterns make spGEMM…

Mathematical Software · Computer Science 2020-10-01 Orestis Zachariadis , Nitin Satpute , Juan Gómez-Luna , Joaquín Olivares

DGEMM on Integer Matrix Multiplication Unit

Deep learning hardware achieves high throughput and low power consumption by reducing computing precision and specializing in matrix multiplication. For machine learning inference, fixed-point value computation is commonplace, where the…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-02 Hiroyuki Ootomo , Katsuhisa Ozaki , Rio Yokota

Parallel Scan on Ascend AI Accelerators

We design and implement parallel prefix sum (scan) algorithms using Ascend AI accelerators. Ascend accelerators feature specialized computing units: the cube units for efficient matrix multiplication and the vector units for optimized…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-05 Bartłomiej Wróblewski , Gioele Gottardo , Anastasios Zouzias

Efficient Quantized Sparse Matrix Operations on Tensor Cores

The exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-09 Shigang Li , Kazuki Osawa , Torsten Hoefler

NVIDIA Tensor Core Programmability, Performance & Precision

The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-12-18 Stefano Markidis , Steven Wei Der Chien , Erwin Laure , Ivy Bo Peng , Jeffrey S. Vetter

Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance

Tensor Core is a mixed-precision matrix-matrix multiplication unit on NVIDIA GPUs with a theoretical peak performance of more than 300 TFlop/s on Ampere architectures. Tensor Cores were developed in response to the high demand of dense…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-19 Hiroyuki Ootomo , Rio Yokota

Tensor Slicing and Optimization for Multicore NPUs

Although code generation for Convolution Neural Network (CNN) models has been extensively studied, performing efficient data slicing and parallelization for highly-constrai\-ned Multicore Neural Processor Units (NPUs) is still a challenging…

Performance · Computer Science 2023-04-07 Rafael Sousa , Marcio Pereira , Yongin Kwon , Taeho Kim , Namsoon Jung , Chang Soo Kim , Michael Frank , Guido Araujo

Accurate Models of NVIDIA Tensor Cores

Matrix multiplication is a fundamental operation in both training of neural networks and inference. To accelerate matrix multiplication, Graphical Processing Units (GPUs) provide it implemented in hardware. Due to the increased throughput…

Mathematical Software · Computer Science 2026-04-07 Faizan A. Khattak , Mantas Mikaitis

Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library

NVIDIA Tensor Core is a mixed-precision matrix-matrix multiplication and addition computing unit, where the theoretical peak performance is more than 300 TFlop/s on NVIDIA A100 GPU. NVIDIA provides WMMA API for using Tensor Cores in custom…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-30 Hiroyuki Ootomo , Rio Yokota

Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors

Tensor Cores have been an important unit to accelerate Fused Matrix Multiplication Accumulation (MMA) in all NVIDIA GPUs since Volta Architecture. To program Tensor Cores, users have to use either legacy wmma APIs or current mma APIs.…

Hardware Architecture · Computer Science 2022-11-29 Wei Sun , Ang Li , Tong Geng , Sander Stuijk , Henk Corporaal

Accelerating Sparse Deep Neural Networks

As neural network model sizes have dramatically increased, so has the interest in various techniques to reduce their parameter counts and accelerate their execution. An active area of research in this field is sparsity - encouraging zero…

Machine Learning · Computer Science 2021-04-20 Asit Mishra , Jorge Albericio Latorre , Jeff Pool , Darko Stosic , Dusan Stosic , Ganesh Venkatesh , Chong Yu , Paulius Micikevicius

In-Datacenter Performance Analysis of a Tensor Processing Unit

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that…

Hardware Architecture · Computer Science 2017-04-18 Norman P. Jouppi , Cliff Young , Nishant Patil , David Patterson , Gaurav Agrawal , Raminder Bajwa , Sarah Bates , Suresh Bhatia , Nan Boden , Al Borchers , Rick Boyle , Pierre-luc Cantin , Clifford Chao , Chris Clark , Jeremy Coriell , Mike Daley , Matt Dau , Jeffrey Dean , Ben Gelb , Tara Vazir Ghaemmaghami , Rajendra Gottipati , William Gulland , Robert Hagmann , C. Richard Ho , Doug Hogberg , John Hu , Robert Hundt , Dan Hurt , Julian Ibarz , Aaron Jaffey , Alek Jaworski , Alexander Kaplan , Harshit Khaitan , Andy Koch , Naveen Kumar , Steve Lacy , James Laudon , James Law , Diemthu Le , Chris Leary , Zhuyuan Liu , Kyle Lucke , Alan Lundin , Gordon MacKean , Adriana Maggiore , Maire Mahony , Kieran Miller , Rahul Nagarajan , Ravi Narayanaswami , Ray Ni , Kathy Nix , Thomas Norrie , Mark Omernick , Narayana Penukonda , Andy Phelps , Jonathan Ross , Matt Ross , Amir Salek , Emad Samadiani , Chris Severn , Gregory Sizikov , Matthew Snelham , Jed Souter , Dan Steinberg , Andy Swing , Mercedes Tan , Gregory Thorson , Bo Tian , Horia Toma , Erick Tuttle , Vijay Vasudevan , Richard Walter , Walter Wang , Eric Wilcox , Doe Hyun Yoon

Accelerating Matrix Multiplication: A Performance Comparison Between Multi-Core CPU and GPU

Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-30 Mufakir Qamar Ansari , Mudabir Qamar Ansari