Related papers: A Parallel Scan Algorithm in the Tensor Core Unit …

Accelerating Reduction and Scan Using Tensor Core Units

Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as TensorCore Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4x4 or…

Performance · Computer Science 2019-11-26 Abdul Dakkak , Cheng Li , Isaac Gelado , Jinjun Xiong , Wen-mei Hwu

A Computational Model for Tensor Core Units

To respond to the need of efficient training and inference of deep neural networks, a plethora of domain-specific hardware architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature…

Data Structures and Algorithms · Computer Science 2020-07-10 Rezaul Chowdhury , Francesco Silvestri , Flavio Vella

Parallel Scan on Ascend AI Accelerators

We design and implement parallel prefix sum (scan) algorithms using Ascend AI accelerators. Ascend accelerators feature specialized computing units: the cube units for efficient matrix multiplication and the vector units for optimized…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-05 Bartłomiej Wróblewski , Gioele Gottardo , Anastasios Zouzias

Tensor Core Units (TCUs) are hardware accelerators developed for deep neural networks, which efficiently support the multiplication of two dense $\sqrt{m}\times \sqrt{m}$ matrices, where $m$ is a given hardware parameter. In this paper, we…

Data Structures and Algorithms · Computer Science 2020-06-24 Thomas D. Ahle , Francesco Silvestri

Parallel Algorithms for Tensor Train Arithmetic

We present efficient and scalable parallel algorithms for performing mathematical operations for low-rank tensors represented in the tensor train (TT) format. We consider algorithms for addition, elementwise multiplication, computing norms…

Numerical Analysis · Mathematics 2021-09-08 Hussam Al Daas , Grey Ballard , Peter Benner

Large Scale Distributed Linear Algebra With Tensor Processing Units

We have repurposed Google Tensor Processing Units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers. The TPUs' fast inter-core interconnects (ICI)s, physically…

Computational Physics · Physics 2022-09-14 Adam G. M. Lewis , Jackson Beall , Martin Ganahl , Markus Hauru , Shrestha Basu Mallick , Guifre Vidal

Parallel Sparse Matrix Multiplication for Linear Scaling Electronic Structure Calculations

Linear-scaling electronic-structure techniques, also called O(N) techniques, rely heavily on the multiplication of sparse matrices, where the sparsity arises from spatial cut-offs. In order to treat very large systems, the calculations must…

Materials Science · Physics 2009-10-31 D. R. Bowler , T. Miyazaki , M. J. Gillan

Parallel Index-Based Structural Graph Clustering and Its Approximation

SCAN (Structural Clustering Algorithm for Networks) is a well-studied, widely used graph clustering algorithm. For large graphs, however, sequential SCAN variants are prohibitively slow, and parallel SCAN variants do not effectively share…

Databases · Computer Science 2021-04-01 Tom Tseng , Laxman Dhulipala , Julian Shun

A parallel algorithm for Gaussian elimination over finite fields

In this paper we describe a parallel Gaussian elimination algorithm for matrices with entries in a finite field. Unlike previous approaches, our algorithm subdivides a very large input matrix into smaller submatrices by subdividing both…

Rings and Algebras · Mathematics 2018-06-13 Stephen Linton , Gabriele Nebe , Alice Niemeyer , Richard Parker , Jon Thackray

Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

Sparse general matrix-matrix multiplication (spGEMM) is an essential component in many scientific and data analytics applications. However, the sparsity pattern of the input matrices and the interaction of their patterns make spGEMM…

Mathematical Software · Computer Science 2020-10-01 Orestis Zachariadis , Nitin Satpute , Juan Gómez-Luna , Joaquín Olivares

Theoretically-Efficient and Practical Parallel DBSCAN

The DBSCAN method for spatial clustering has received significant attention due to its applicability in a variety of data analysis tasks. There are fast sequential algorithms for DBSCAN in Euclidean space that take $O(n\log n)$ work for two…

Data Structures and Algorithms · Computer Science 2021-01-29 Yiqiu Wang , Yan Gu , Julian Shun

Multiplica\c{c}\~ao de matrizes: uma compara\c{c}\~ao entre as abordagens sequencial (CPU) e paralela (GPU)

Designing problems using matrices is very important in Computer Science. Fields like graph computer, graphs theory, and machine learning use matrices very often to solve their own problems. The most often matrix operation is the…

Performance · Computer Science 2019-05-10 Andre G. C. Pacheco

Parallel Computation of Multi-Slice Clustering of Third-Order Tensors

Machine Learning approaches like clustering methods deal with massive datasets that present an increasing challenge. We devise parallel algorithms to compute the Multi-Slice Clustering (MSC) for 3rd-order tensors. The MSC method is based on…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-02 Dina Faneva Andriantsiory , Camille Coti , Joseph Ben Geloun , Mustapha Lebbah

Low-Overhead Parallelisation of LCU via Commuting Operators

The Linear Combination of Unitaries (LCU) method is a powerful scheme for the block encoding of operators but suffers from high overheads. In this work, we discuss the parallelisation of LCU and in particular the SELECT subroutine of LCU…

Quantum Physics · Physics 2024-08-22 Gregory Boyd

Parallel algorithms for computing the tensor-train decomposition

The tensor-train (TT) decomposition expresses a tensor in a data-sparse format used in molecular simulations, high-order correlation functions, and optimization. In this paper, we propose four parallelizable algorithms that compute the TT…

Numerical Analysis · Mathematics 2021-11-23 Tianyi Shi , Maximilian Ruth , Alex Townsend

Coded Computing Meets Quantum Circuit Simulation: Coded Parallel Tensor Network Contraction Algorithm

Parallel tensor network contraction algorithms have emerged as the pivotal benchmarks for assessing the classical limits of computation, exemplified by Google's demonstration of quantum supremacy through random circuit sampling. However,…

Information Theory · Computer Science 2024-05-24 Jin Lee , Sofia Gonzalez-Garcia , Zheng Zhang , Haewon Jeong

Parallel Dynamics Computation using Prefix Sum Operations

We propose a new parallel framework for fast computation of inverse and forward dynamics of articulated robots based on prefix sums (scans). We re-investigate the well-known recursive Newton-Euler formulation of robot dynamics and show that…

Robotics · Computer Science 2016-09-16 Yajue Yang , Yuanqing Wu , Jia Pan

Category Theory for Supercomputing: The Tensor Product of Linear BSP Algorithms

We show that a particular class of parallel algorithm for linear functions can be straightforwardly generalized to a parallel algorithm of their tensor product. The central idea is to take a model of parallel algorithms -- Bulk Synchronous…

Category Theory · Mathematics 2025-10-02 Thomas Koopman , Rob H. Bisseling , Sven-Bodo Scholz

Parallel image thinning through topological operators on shared memory parallel machines

In this paper, we present a concurrent implementation of a powerful topological thinning operator. This operator is able to act directly over grayscale images without modifying their topology. We introduce an adapted parallelization…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-03-31 Ramzi Mahmoudi , Mohamed Akil , Petr Matas

Round Compression for Parallel Graph Algorithms in Strongly Sublinear Space

The Massive Parallel Computation (MPC) model is a theoretical framework for popular parallel and distributed platforms such as MapReduce, Hadoop, or Spark. We consider the task of computing a large matching or small vertex cover in this…

Data Structures and Algorithms · Computer Science 2018-07-24 Krzysztof Onak