English
Related papers

Related papers: Modeling Deep Learning Accelerator Enabled GPUs

200 papers

The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-12-18 Stefano Markidis , Steven Wei Der Chien , Erwin Laure , Ivy Bo Peng , Jeffrey S. Vetter

The Nvidia GPU architecture has introduced new computing elements such as the \textit{tensor cores}, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate \textit{Deep…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-03-12 Roberto Carrasco , Raimundo Vega , Cristóbal A. Navarro

This paper explores the impact of simulator accuracy on architecture design decisions in the general-purpose graphics processing unit (GPGPU) space. We perform a detailed, quantitative analysis of the most popular publicly available GPU…

Hardware Architecture · Computer Science 2020-06-04 Mahmoud Khairy , Jain Akshay , Tor Aamodt , Timothy G. Rogers

General Purpose Graphics Processing Unit (GPGPU) computing plays a transformative role in deep learning and machine learning by leveraging the computational advantages of parallel processing. Through the power of Compute Unified Device…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-20 Ming Li , Ziqian Bi , Tianyang Wang , Yizhu Wen , Qian Niu , Xinyuan Song , Zekun Jiang , Junyu Liu , Benji Peng , Sen Zhang , Xuanhe Pan , Jiawei Xu , Jinlang Wang , Keyu Chen , Caitlyn Heqi Yin , Pohsun Feng , Ming Liu

Most deep neural networks deployed today are trained using GPUs via high-level frameworks such as TensorFlow and PyTorch. This paper describes changes we made to the GPGPU-Sim simulator to enable it to run PyTorch by running PTX kernels…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-01-29 Jonathan Lew , Deval Shah , Suchita Pati , Shaylin Cattell , Mengchi Zhang , Amruth Sandhupatla , Christopher Ng , Negar Goli , Matthew D. Sinclair , Timothy G. Rogers , Tor Aamodt

Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for…

Tensor Cores have been an important unit to accelerate Fused Matrix Multiplication Accumulation (MMA) in all NVIDIA GPUs since Volta Architecture. To program Tensor Cores, users have to use either legacy wmma APIs or current mma APIs.…

Hardware Architecture · Computer Science 2022-11-29 Wei Sun , Ang Li , Tong Geng , Sander Stuijk , Henk Corporaal

Graphics processing units (GPUs) are continually evolving to cater to the computational demands of contemporary general-purpose workloads, particularly those driven by artificial intelligence (AI) utilizing deep learning techniques. A…

Hardware Architecture · Computer Science 2024-02-22 Weile Luo , Ruibo Fan , Zeyu Li , Dayou Du , Qiang Wang , Xiaowen Chu

Recent studies from several hyperscalars pinpoint to embedding layers as the most memory-intensive deep learning (DL) algorithm being deployed in today's datacenters. This paper addresses the memory capacity and bandwidth challenges of…

Machine Learning · Computer Science 2019-08-27 Youngeun Kwon , Yunjae Lee , Minsoo Rhu

With the rapid development of in-depth learning, neural network and deep learning algorithms have been widely used in various fields, e.g., image, video and voice processing. However, the neural network model is getting larger and larger,…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-12-30 Teng Wang , Chao Wang , Xuehai Zhou , Huaping Chen

Tensor cores are specialized processing units within GPUs that have demonstrated significant efficiency gains in compute-bound applications such as Deep Learning Training by accelerating dense matrix operations. Given their success,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-04 Lingqi Zhang , Jiajun Huang , Sheng Di , Satoshi Matsuoka , Mohamed Wahib

To respond to the need of efficient training and inference of deep neural networks, a plethora of domain-specific hardware architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature…

Data Structures and Algorithms · Computer Science 2020-07-10 Rezaul Chowdhury , Francesco Silvestri , Flavio Vella

We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective…

Machine Learning · Computer Science 2019-01-10 Tianqi Chen , Lianmin Zheng , Eddie Yan , Ziheng Jiang , Thierry Moreau , Luis Ceze , Carlos Guestrin , Arvind Krishnamurthy

Many research works have been performed on implementation of Vitrerbi decoding algorithm on GPU instead of FPGA because this platform provides considerable flexibility in addition to great performance. Recently, the recently-introduced…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-30 Alireza Mohammadidoost , Matin Hashemi

In 2019, the rapid rate at which GPU manufacturers refresh their designs, coupled with their reluctance to disclose microarchitectural details, is still a hurdle for those software designers who want to extract the highest possible…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-03-19 Zhe Jia , Marco Maggioni , Jeffrey Smith , Daniele Paolo Scarpazza

Many artificial intelligence (AI) devices have been developed to accelerate the training and inference of neural networks models. The most common ones are the Graphics Processing Unit (GPU) and Tensor Processing Unit (TPU). They are highly…

Machine Learning · Computer Science 2022-10-25 xiangyang Ju , Yunsong Wang , Daniel Murnane , Nicholas Choma , Steven Farrell , Paolo Calafiura

Many recent computational accelerators provide non-standard (e.g., reduced precision) arithmetic operations to enhance performance for floating-point matrix multiplication. Unfortunately, the properties of these accelerators are not widely…

Hardware Architecture · Computer Science 2025-02-25 Benjamin Valpey , Xinyi Li , Sreepathi Pai , Ganesh Gopalakrishnan

Every year, novel NVIDIA GPU designs are introduced. This rapid architectural and technological progression, coupled with a reluctance by manufacturers to disclose low-level details, makes it difficult for even the most proficient GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-04-19 Zhe Jia , Marco Maggioni , Benjamin Staiger , Daniele P. Scarpazza

Tensor Processing Units (TPUs) are specialized hardware accelerators for deep learning developed by Google. This paper aims to explore TPUs in cloud and edge computing focusing on its applications in AI. We provide an overview of TPUs,…

Hardware Architecture · Computer Science 2023-11-15 Diego Sanmartín Carrión , Vera Prohaska

We introduce the CUDA Tensor Transpose (cuTT) library that implements high-performance tensor transposes for NVIDIA GPUs with Kepler and above architectures. cuTT achieves high performance by (a) utilizing two GPU-optimized transpose…

Mathematical Software · Computer Science 2017-05-05 Antti-Pekka Hynninen , Dmitry I. Lyakh
‹ Prev 1 2 3 10 Next ›