English
Related papers

Related papers: Task-Based Tensor Computations on Modern GPUs

200 papers

This study presents a comprehensive multi-level analysis of the NVIDIA Hopper GPU architecture, focusing on its performance characteristics and novel features. We benchmark Hopper's memory subsystem, highlighting improvements in the L2…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-05 Weile Luo , Ruibo Fan , Zeyu Li , Dayou Du , Hongyuan Liu , Qiang Wang , Xiaowen Chu

The efficacy of deep learning has resulted in its use in a growing number of applications. The Volta graphics processor unit (GPU) architecture from NVIDIA introduced a specialized functional unit, the "tensor core", that helps meet the…

Mathematical Software · Computer Science 2019-02-22 Md Aamir Raihan , Negar Goli , Tor Aamodt

To achieve peak performance on modern GPUs, one must balance two frames of mind: issuing instructions to individual threads to control their behavior, while simultaneously tracking the convergence of many threads acting in concert to…

Programming Languages · Computer Science 2025-11-18 Manya Bansal , Daniel Sainati , Joseph W. Cutler , Saman Amarasinghe , Jonathan Ragan-Kelley

We provide an optimized implementation of the forward pass of FlashAttention-2, a popular memory-aware scaled dot-product attention algorithm, as a custom fused CUDA kernel targeting NVIDIA Hopper architecture and written using the…

Machine Learning · Computer Science 2023-12-20 Ganesh Bikshandi , Jay Shah

The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-12-18 Stefano Markidis , Steven Wei Der Chien , Erwin Laure , Ivy Bo Peng , Jeffrey S. Vetter

Graphics processing units (GPUs) are continually evolving to cater to the computational demands of contemporary general-purpose workloads, particularly those driven by artificial intelligence (AI) utilizing deep learning techniques. A…

Hardware Architecture · Computer Science 2024-02-22 Weile Luo , Ruibo Fan , Zeyu Li , Dayou Du , Qiang Wang , Xiaowen Chu

Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML…

Hardware Architecture · Computer Science 2024-07-12 Mohammed Elbtity , Peyton Chandarana , Ramtin Zand

In this paper, we explore the acceleration of tensor product operations in finite element methods, leveraging the computational power of the NVIDIA A100 GPU Tensor Cores. We provide an accessible overview of the necessary mathematical…

Mathematical Software · Computer Science 2024-07-16 Cu Cui

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that…

As deep learning models nowadays are widely adopted by both cloud services and edge devices, reducing the latency of deep learning model inferences becomes crucial to provide efficient model serving. However, it is challenging to develop…

Machine Learning · Computer Science 2023-02-16 Yaoyao Ding , Cody Hao Yu , Bojian Zheng , Yizhi Liu , Yida Wang , Gennady Pekhimenko

In this paper, we introduce Heteroflow, a new C++ library to help developers quickly write parallel CPU-GPU programs using task dependency graphs. Heteroflow leverages the power of modern C++ and task-based approaches to enable efficient…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-17 Tsung-Wei Huang , Yibo Lin

Recently, tensor algebra have witnessed significant applications across various domains. Each operator in tensor algebra features different computational workload and precision. However, current general accelerators, such as VPU, GPGPU, and…

Hardware Architecture · Computer Science 2024-05-06 Chenyang Ai , Lechuan Zhao , Zhijie Huang , Cangyuan Li , Xinan Wang , Ying Wang

Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for…

Modern GPUs adopt chiplet-based designs with multiple private cache hierarchies, but current programming models (CUDA/HIP) expose a flat execution hierarchy that cannot express chiplet-level locality or synchronization. This mismatch leads…

Heterogeneous nodes that combine multi-core CPUs with diverse accelerators are rapidly becoming the norm in both high-performance computing (HPC) and AI infrastructures. Exploiting these platforms, however, requires orchestrating several…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-02 Aleix Boné , Alejandro Aguirre , David Álvarez , Pedro J. Martinez-Ferrer , Vicenç Beltran

With the rapid advent of generative models, efficiently deploying these models on specialized hardware has become critical. Tensor Processing Units (TPUs) are designed to accelerate AI workloads, but their high power consumption…

Hardware Architecture · Computer Science 2025-03-04 Zhantong Zhu , Hongou Li , Wenjie Ren , Meng Wu , Le Ye , Ru Huang , Tianyu Jia

The Nvidia GPU architecture has introduced new computing elements such as the \textit{tensor cores}, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate \textit{Deep…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-03-12 Roberto Carrasco , Raimundo Vega , Cristóbal A. Navarro

We present novel algorithmic solutions together with implementation details utilizing non-Abelian symmetries in order to boost the current limits of tensor network state algorithms on high performance computing infrastructure. In our…

Computational Physics · Physics 2023-10-02 Andor Menczer , Örs Legeza

Tensor Cores have been an important unit to accelerate Fused Matrix Multiplication Accumulation (MMA) in all NVIDIA GPUs since Volta Architecture. To program Tensor Cores, users have to use either legacy wmma APIs or current mma APIs.…

Hardware Architecture · Computer Science 2022-11-29 Wei Sun , Ang Li , Tong Geng , Sander Stuijk , Henk Corporaal

Modern graphics computing units (GPUs) are designed and optimized to perform highly parallel numerical calculations. This parallelism has enabled (and promises) significant advantages, both in terms of energy performance and calculation. In…

Hardware Architecture · Computer Science 2021-10-26 Quentin Gallouédec
‹ Prev 1 2 3 10 Next ›