Related papers: GPU Tensor Cores for fast Arithmetic Reductions
The Nvidia GPU architecture has introduced new computing elements such as the \textit{tensor cores}, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate \textit{Deep…
Tensor Cores have been an important unit to accelerate Fused Matrix Multiplication Accumulation (MMA) in all NVIDIA GPUs since Volta Architecture. To program Tensor Cores, users have to use either legacy wmma APIs or current mma APIs.…
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta…
Tensor cores (TCs) are a type of Application-Specific Integrated Circuit (ASIC) and are a recent addition to Graphics Processing Unit (GPU) architectures. As such, TCs are purposefully designed to greatly improve the performance of Matrix…
Reduction operations are extensively employed in many computational problems. A reduction consists of, given a finite set of numeric elements, combining into a single value all elements in that set, using for this a combiner function. A…
Convolution is one of the fundamental operations of deep neural networks with demanding matrix computation. In a graphic processing unit (GPU), Tensor Core is a specialized matrix processing hardware equipped with reduced-precision…
This work presents a GPU thread mapping approach that allows doing fast parallel stencil-like computations on discrete fractals using their compact representation. The intuition behind is to employ two GPU tensor-core accelerated thread…
Tensor cores are specialized processing units within GPUs that have demonstrated significant efficiency gains in compute-bound applications such as Deep Learning Training by accelerating dense matrix operations. Given their success,…
Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as TensorCore Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4x4 or…
Tensor decomposition has been widely used in machine learning and high-volume data analysis. However, large-scale tensor factorization often consumes huge memory and computing cost. Meanwhile, modernized computing hardware such as tensor…
Modern GPUs are equipped with tensor cores (TCs) that are commonly used for matrix multiplication in artificial intelligence workloads. However, because they have high computational throughput, they can lead to significant performance gains…
In this paper, we explore the acceleration of tensor product operations in finite element methods, leveraging the computational power of the NVIDIA A100 GPU Tensor Cores. We provide an accessible overview of the necessary mathematical…
NVIDIA Tensor Core is a mixed-precision matrix-matrix multiplication and addition computing unit, where the theoretical peak performance is more than 300 TFlop/s on NVIDIA A100 GPU. NVIDIA provides WMMA API for using Tensor Cores in custom…
The recent trend of using Graphics Processing Units (GPU's) for high performance computations is driven by the high ratio of price performance for these units, complemented by their cost effectiveness. At first glance, computational fluid…
Modern graphics computing units (GPUs) are designed and optimized to perform highly parallel numerical calculations. This parallelism has enabled (and promises) significant advantages, both in terms of energy performance and calculation. In…
Graph neural networks (GNNs) have seen extensive application in domains such as social networks, bioinformatics, and recommendation systems. However, the irregularity and sparsity of graph data challenge traditional computing methods, which…
Efficient simulation of quantum circuits has become indispensable with the rapid development of quantum hardware. The primary simulation methods are based on state vectors and tensor networks. As the number of qubits and quantum gates grows…
AI models are increasing in size and recent advancement in the community has shown that unlike HPC applications where double precision datatype are required, lower-precision datatypes such as fp8 or int4 are sufficient to bring the same…
We present efficient and scalable parallel algorithms for performing mathematical operations for low-rank tensors represented in the tensor train (TT) format. We consider algorithms for addition, elementwise multiplication, computing norms…
Many research works have been performed on implementation of Vitrerbi decoding algorithm on GPU instead of FPGA because this platform provides considerable flexibility in addition to great performance. Recently, the recently-introduced…