English
Related papers

Related papers: NVIDIA Tensor Core Programmability, Performance & …

200 papers

Tensor Core is a mixed-precision matrix-matrix multiplication unit on NVIDIA GPUs with a theoretical peak performance of more than 300 TFlop/s on Ampere architectures. Tensor Cores were developed in response to the high demand of dense…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-19 Hiroyuki Ootomo , Rio Yokota

NVIDIA Tensor Core is a mixed-precision matrix-matrix multiplication and addition computing unit, where the theoretical peak performance is more than 300 TFlop/s on NVIDIA A100 GPU. NVIDIA provides WMMA API for using Tensor Cores in custom…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-30 Hiroyuki Ootomo , Rio Yokota

In this paper, we explore the acceleration of tensor product operations in finite element methods, leveraging the computational power of the NVIDIA A100 GPU Tensor Cores. We provide an accessible overview of the necessary mathematical…

Mathematical Software · Computer Science 2024-07-16 Cu Cui

Tensor Cores have been an important unit to accelerate Fused Matrix Multiplication Accumulation (MMA) in all NVIDIA GPUs since Volta Architecture. To program Tensor Cores, users have to use either legacy wmma APIs or current mma APIs.…

Hardware Architecture · Computer Science 2022-11-29 Wei Sun , Ang Li , Tong Geng , Sander Stuijk , Henk Corporaal

Many recent computational accelerators provide non-standard (e.g., reduced precision) arithmetic operations to enhance performance for floating-point matrix multiplication. Unfortunately, the properties of these accelerators are not widely…

Hardware Architecture · Computer Science 2025-02-25 Benjamin Valpey , Xinyi Li , Sreepathi Pai , Ganesh Gopalakrishnan

The Nvidia GPU architecture has introduced new computing elements such as the \textit{tensor cores}, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate \textit{Deep…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-03-12 Roberto Carrasco , Raimundo Vega , Cristóbal A. Navarro

Modern graphics computing units (GPUs) are designed and optimized to perform highly parallel numerical calculations. This parallelism has enabled (and promises) significant advantages, both in terms of energy performance and calculation. In…

Hardware Architecture · Computer Science 2021-10-26 Quentin Gallouédec

Matrix multiplication is a fundamental operation in both training of neural networks and inference. To accelerate matrix multiplication, Graphical Processing Units (GPUs) provide it implemented in hardware. Due to the increased throughput…

Mathematical Software · Computer Science 2026-04-07 Faizan A. Khattak , Mantas Mikaitis

The efficacy of deep learning has resulted in its use in a growing number of applications. The Volta graphics processor unit (GPU) architecture from NVIDIA introduced a specialized functional unit, the "tensor core", that helps meet the…

Mathematical Software · Computer Science 2019-02-22 Md Aamir Raihan , Negar Goli , Tor Aamodt

Fast Fourier Transform (FFT) is an essential tool in scientific and engineering computation. The increasing demand for mixed-precision FFT has made it possible to utilize half-precision floating-point (FP16) arithmetic for faster speed and…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-26 Binrui Li , Shenggan Cheng , James Lin

This work proposes a GPU tensor core approach that encodes the arithmetic reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-17 Cristóbal A. Navarro , Roberto Carrasco , Ricardo J. Barrientos , Javier A. Riquelme , Raimundo Vega

In recent years, a new kind of accelerated hardware has gained popularity in the Artificial Intelligence (AI) and Machine Learning (ML) communities which enables extremely high-performance tensor contractions in reduced precision for deep…

Computational Physics · Physics 2024-05-01 Adela Habib , Joshua Finkelstein , Anders M. N. Niklasson

Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as TensorCore Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4x4 or…

Performance · Computer Science 2019-11-26 Abdul Dakkak , Cheng Li , Isaac Gelado , Jinjun Xiong , Wen-mei Hwu

Random projection can reduce the dimension of data while capturing its structure and is a fundamental tool for machine learning, signal processing, and information retrieval, which deal with a large amount of data today. RandNLA (Randomized…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-04-11 Hiroyuki Ootomo , Rio Yokota

NVIDIA Tensor Cores and AMD Matrix Cores (together called Matrix Accelerators) are of growing interest in high-performance computing and machine learning owing to their high performance. Unfortunately, their numerical behaviors are not…

Hardware Architecture · Computer Science 2024-03-04 Xinyi Li , Ang Li , Bo Fang , Katarzyna Swirydowicz , Ignacio Laguna , Ganesh Gopalakrishnan

This study presents a comprehensive multi-level analysis of the NVIDIA Hopper GPU architecture, focusing on its performance characteristics and novel features. We benchmark Hopper's memory subsystem, highlighting improvements in the L2…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-05 Weile Luo , Ruibo Fan , Zeyu Li , Dayou Du , Hongyuan Liu , Qiang Wang , Xiaowen Chu

Efficient mixed-precision matrix multiply accumulate (MMA) operations are critical for accelerating deep learning workloads on GPGPUs. However, existing open-source dot product implementations for Tensor Cores rely on discrete arithmetic…

Hardware Architecture · Computer Science 2026-04-07 Nikhil Rout , Blaise Tine

Driven by the insatiable needs to process ever larger amount of data with more complex models, modern computer processors and accelerators are beginning to offer half precision floating point arithmetic support, and extremely optimized…

Mathematical Software · Computer Science 2019-12-12 Shaoshuai Zhang , Panruo Wu

Tensor cores are specialized processing units within GPUs that have demonstrated significant efficiency gains in compute-bound applications such as Deep Learning Training by accelerating dense matrix operations. Given their success,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-04 Lingqi Zhang , Jiajun Huang , Sheng Di , Satoshi Matsuoka , Mohamed Wahib

Recently, tensor algebra have witnessed significant applications across various domains. Each operator in tensor algebra features different computational workload and precision. However, current general accelerators, such as VPU, GPGPU, and…

Hardware Architecture · Computer Science 2024-05-06 Chenyang Ai , Lechuan Zhao , Zhijie Huang , Cangyuan Li , Xinan Wang , Ying Wang
‹ Prev 1 2 3 10 Next ›