English
Related papers

Related papers: GPU Tensor Cores for fast Arithmetic Reductions

200 papers

The Nvidia GPU architecture has introduced new computing elements such as the \textit{tensor cores}, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate \textit{Deep…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-03-12 Roberto Carrasco , Raimundo Vega , Cristóbal A. Navarro

Tensor Cores have been an important unit to accelerate Fused Matrix Multiplication Accumulation (MMA) in all NVIDIA GPUs since Volta Architecture. To program Tensor Cores, users have to use either legacy wmma APIs or current mma APIs.…

Hardware Architecture · Computer Science 2022-11-29 Wei Sun , Ang Li , Tong Geng , Sander Stuijk , Henk Corporaal

The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-12-18 Stefano Markidis , Steven Wei Der Chien , Erwin Laure , Ivy Bo Peng , Jeffrey S. Vetter

Tensor cores (TCs) are a type of Application-Specific Integrated Circuit (ASIC) and are a recent addition to Graphics Processing Unit (GPU) architectures. As such, TCs are purposefully designed to greatly improve the performance of Matrix…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-21 Benoit Gallet , Michael Gowanlock

Reduction operations are extensively employed in many computational problems. A reduction consists of, given a finite set of numeric elements, combining into a single value all elements in that set, using for this a combiner function. A…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-23 Walid Jradi , Hugo do Nascimento , Wellington Martins

Convolution is one of the fundamental operations of deep neural networks with demanding matrix computation. In a graphic processing unit (GPU), Tensor Core is a specialized matrix processing hardware equipped with reduced-precision…

Machine Learning · Computer Science 2022-02-25 Junkyeong Choi , Hyucksung Kwon , Woongkyu Lee , Jungwook Choi , Jieun Lim

This work presents a GPU thread mapping approach that allows doing fast parallel stencil-like computations on discrete fractals using their compact representation. The intuition behind is to employ two GPU tensor-core accelerated thread…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-26 Felipe A. Quezada , Cristóbal A. Navarro

Tensor cores are specialized processing units within GPUs that have demonstrated significant efficiency gains in compute-bound applications such as Deep Learning Training by accelerating dense matrix operations. Given their success,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-04 Lingqi Zhang , Jiajun Huang , Sheng Di , Satoshi Matsuoka , Mohamed Wahib

Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as TensorCore Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4x4 or…

Performance · Computer Science 2019-11-26 Abdul Dakkak , Cheng Li , Isaac Gelado , Jinjun Xiong , Wen-mei Hwu

Tensor decomposition has been widely used in machine learning and high-volume data analysis. However, large-scale tensor factorization often consumes huge memory and computing cost. Meanwhile, modernized computing hardware such as tensor…

Optimization and Control · Mathematics 2022-09-12 Zi Yang , Junnan Shan , Zheng Zhang

Modern GPUs are equipped with tensor cores (TCs) that are commonly used for matrix multiplication in artificial intelligence workloads. However, because they have high computational throughput, they can lead to significant performance gains…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-01 Brian Curless , Michael Gowanlock

In this paper, we explore the acceleration of tensor product operations in finite element methods, leveraging the computational power of the NVIDIA A100 GPU Tensor Cores. We provide an accessible overview of the necessary mathematical…

Mathematical Software · Computer Science 2024-07-16 Cu Cui

NVIDIA Tensor Core is a mixed-precision matrix-matrix multiplication and addition computing unit, where the theoretical peak performance is more than 300 TFlop/s on NVIDIA A100 GPU. NVIDIA provides WMMA API for using Tensor Cores in custom…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-30 Hiroyuki Ootomo , Rio Yokota

The recent trend of using Graphics Processing Units (GPU's) for high performance computations is driven by the high ratio of price performance for these units, complemented by their cost effectiveness. At first glance, computational fluid…

Computational Engineering, Finance, and Science · Computer Science 2018-02-13 Kiril S. Shterev

Modern graphics computing units (GPUs) are designed and optimized to perform highly parallel numerical calculations. This parallelism has enabled (and promises) significant advantages, both in terms of energy performance and calculation. In…

Hardware Architecture · Computer Science 2021-10-26 Quentin Gallouédec

Graph neural networks (GNNs) have seen extensive application in domains such as social networks, bioinformatics, and recommendation systems. However, the irregularity and sparsity of graph data challenge traditional computing methods, which…

Machine Learning · Computer Science 2025-02-25 Ka Wai Wu

Efficient simulation of quantum circuits has become indispensable with the rapid development of quantum hardware. The primary simulation methods are based on state vectors and tensor networks. As the number of qubits and quantum gates grows…

Quantum Physics · Physics 2024-08-13 Feng Pan , Hanfeng Gu , Lvlin Kuang , Bing Liu , Pan Zhang

AI models are increasing in size and recent advancement in the community has shown that unlike HPC applications where double precision datatype are required, lower-precision datatypes such as fp8 or int4 are sufficient to bring the same…

Performance · Computer Science 2023-10-11 Saeed Maleki

We present efficient and scalable parallel algorithms for performing mathematical operations for low-rank tensors represented in the tensor train (TT) format. We consider algorithms for addition, elementwise multiplication, computing norms…

Numerical Analysis · Mathematics 2021-09-08 Hussam Al Daas , Grey Ballard , Peter Benner

Many research works have been performed on implementation of Vitrerbi decoding algorithm on GPU instead of FPGA because this platform provides considerable flexibility in addition to great performance. Recently, the recently-introduced…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-30 Alireza Mohammadidoost , Matin Hashemi
‹ Prev 1 2 3 10 Next ›