Related papers: TCUDB: Accelerating Database with Tensor Processor…

Accelerating Reduction and Scan Using Tensor Core Units

Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as TensorCore Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4x4 or…

Performance · Computer Science 2019-11-26 Abdul Dakkak , Cheng Li , Isaac Gelado , Jinjun Xiong , Wen-mei Hwu

RTCUDB: Building Databases with RT Processors

A spectrum of new hardware has been studied to accelerate database systems in the past decade. Specifically, CUDA cores are known to benefit from the fast development of GPUs and make notable performance improvements. The state-of-the-art…

Databases · Computer Science 2024-12-16 Xuri Shi , Kai Zhang , X. Sean Wang , Xiaodong Zhang , Rubao Lee

A Computational Model for Tensor Core Units

To respond to the need of efficient training and inference of deep neural networks, a plethora of domain-specific hardware architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature…

Data Structures and Algorithms · Computer Science 2020-07-10 Rezaul Chowdhury , Francesco Silvestri , Flavio Vella

cuTeSpMM: Accelerating Sparse-Dense Matrix Multiplication using GPU Tensor Cores

Many recent GPUs feature matrix multiplication engines (aka Tensor Core Units or TCUs) that perform small fixed-size matrix-matrix products at very high throughput. They have been used very effectively to speed up dense matrix-matrix…

Performance · Computer Science 2025-11-25 Lizhi Xiang , Omid Asudeh , Gerald Sabin , Aravind Sukumaran-Rajam , P. Sadayappan

Query Processing on Tensor Computation Runtimes

The huge demand for computation in artificial intelligence (AI) is driving unparalleled investments in hardware and software systems for AI. This leads to an explosion in the number of specialized hardware devices, which are now offered by…

Databases · Computer Science 2023-02-13 Dong He , Supun Nakandala , Dalitso Banda , Rathijit Sen , Karla Saur , Kwanghyun Park , Carlo Curino , Jesús Camacho-Rodríguez , Konstantinos Karanasos , Matteo Interlandi

State-of-the-Art on Query & Transaction Processing Acceleration

The vast amount of processing power and memory bandwidth provided by modern Graphics Processing Units (GPUs) make them a platform for data-intensive applications. The database community identified GPUs as effective co-processors for data…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-02 Bernd Amann , Youry Khmelevsky , Gaetan Hains

The Tensor Data Platform: Towards an AI-centric Database System

Database engines have historically absorbed many of the innovations in data processing, adding features to process graph data, XML, object oriented, and text among many others. In this paper, we make the case that it is time to do the same…

Databases · Computer Science 2022-11-21 Apurva Gandhi , Yuki Asada , Victor Fu , Advitya Gemawat , Lihao Zhang , Rathijit Sen , Carlo Curino , Jesús Camacho-Rodríguez , Matteo Interlandi

Efficient Quantized Sparse Matrix Operations on Tensor Cores

The exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-09 Shigang Li , Kazuki Osawa , Torsten Hoefler

Tensor Core Units (TCUs) are hardware accelerators developed for deep neural networks, which efficiently support the multiplication of two dense $\sqrt{m}\times \sqrt{m}$ matrices, where $m$ is a given hardware parameter. In this paper, we…

Data Structures and Algorithms · Computer Science 2020-06-24 Thomas D. Ahle , Francesco Silvestri

A Comprehensive Overview of GPU Accelerated Databases

Over the past decade, the landscape of data analytics has seen a notable shift towards heterogeneous architectures, particularly the integration of GPUs to enhance overall performance. In the realm of in-memory analytics, which often…

Databases · Computer Science 2024-06-21 Harshit Sharma , Anmol Sharma

Accurate Models of NVIDIA Tensor Cores

Matrix multiplication is a fundamental operation in both training of neural networks and inference. To accelerate matrix multiplication, Graphical Processing Units (GPUs) provide it implemented in hardware. Due to the increased throughput…

Mathematical Software · Computer Science 2026-04-07 Faizan A. Khattak , Mantas Mikaitis

Accelerating Matrix Multiplication: A Performance Comparison Between Multi-Core CPU and GPU

Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-30 Mufakir Qamar Ansari , Mudabir Qamar Ansari

Large Scale Distributed Linear Algebra With Tensor Processing Units

We have repurposed Google Tensor Processing Units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers. The TPUs' fast inter-core interconnects (ICI)s, physically…

Computational Physics · Physics 2022-09-14 Adam G. M. Lewis , Jackson Beall , Martin Ganahl , Markus Hauru , Shrestha Basu Mallick , Guifre Vidal

In-Datacenter Performance Analysis of a Tensor Processing Unit

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that…

Hardware Architecture · Computer Science 2017-04-18 Norman P. Jouppi , Cliff Young , Nishant Patil , David Patterson , Gaurav Agrawal , Raminder Bajwa , Sarah Bates , Suresh Bhatia , Nan Boden , Al Borchers , Rick Boyle , Pierre-luc Cantin , Clifford Chao , Chris Clark , Jeremy Coriell , Mike Daley , Matt Dau , Jeffrey Dean , Ben Gelb , Tara Vazir Ghaemmaghami , Rajendra Gottipati , William Gulland , Robert Hagmann , C. Richard Ho , Doug Hogberg , John Hu , Robert Hundt , Dan Hurt , Julian Ibarz , Aaron Jaffey , Alek Jaworski , Alexander Kaplan , Harshit Khaitan , Andy Koch , Naveen Kumar , Steve Lacy , James Laudon , James Law , Diemthu Le , Chris Leary , Zhuyuan Liu , Kyle Lucke , Alan Lundin , Gordon MacKean , Adriana Maggiore , Maire Mahony , Kieran Miller , Rahul Nagarajan , Ravi Narayanaswami , Ray Ni , Kathy Nix , Thomas Norrie , Mark Omernick , Narayana Penukonda , Andy Phelps , Jonathan Ross , Matt Ross , Amir Salek , Emad Samadiani , Chris Severn , Gregory Sizikov , Matthew Snelham , Jed Souter , Dan Steinberg , Andy Swing , Mercedes Tan , Gregory Thorson , Bo Tian , Horia Toma , Erick Tuttle , Vijay Vasudevan , Richard Walter , Walter Wang , Eric Wilcox , Doe Hyun Yoon

Can Tensor Cores Benefit Memory-Bound Kernels? (No!)

Tensor cores are specialized processing units within GPUs that have demonstrated significant efficiency gains in compute-bound applications such as Deep Learning Training by accelerating dense matrix operations. Given their success,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-04 Lingqi Zhang , Jiajun Huang , Sheng Di , Satoshi Matsuoka , Mohamed Wahib

Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

Sparse general matrix-matrix multiplication (spGEMM) is an essential component in many scientific and data analytics applications. However, the sparsity pattern of the input matrices and the interaction of their patterns make spGEMM…

Mathematical Software · Computer Science 2020-10-01 Orestis Zachariadis , Nitin Satpute , Juan Gómez-Luna , Joaquín Olivares

Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs

Despite foreseeing tremendous speedups over conventional deep neural networks, the performance advantage of binarized neural networks (BNNs) has merely been showcased on general-purpose processors such as CPUs and GPUs. In fact, due to…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-16 Ang Li , Simon Su

Accelerating Sparse Graph Neural Networks with Tensor Core Optimization

Graph neural networks (GNNs) have seen extensive application in domains such as social networks, bioinformatics, and recommendation systems. However, the irregularity and sparsity of graph data challenge traditional computing methods, which…

Machine Learning · Computer Science 2025-02-25 Ka Wai Wu

GTA: a new General Tensor Accelerator with Better Area Efficiency and Data Reuse

Recently, tensor algebra have witnessed significant applications across various domains. Each operator in tensor algebra features different computational workload and precision. However, current general accelerators, such as VPU, GPGPU, and…

Hardware Architecture · Computer Science 2024-05-06 Chenyang Ai , Lechuan Zhao , Zhijie Huang , Cangyuan Li , Xinan Wang , Ying Wang

Akceleracja obliczen algebry liniowej z wykorzystaniem masywnie rownoleglych, wielordzeniowych procesorow GPU

The paper presents the aspect of use of modern graphics accelerators supporting CUDA technology for high-performance computing in the field of linear algebra. Fully programmable graphic cards have been available for several years for both…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-06-27 Lukasz Swierczewski