Related papers: A Parallel Scan Algorithm in the Tensor Core Unit …
Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as TensorCore Units (TCUs). These TCUs are capable of performing matrix multiplications on small matrices (usually 4x4 or…
To respond to the need of efficient training and inference of deep neural networks, a plethora of domain-specific hardware architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature…
We design and implement parallel prefix sum (scan) algorithms using Ascend AI accelerators. Ascend accelerators feature specialized computing units: the cube units for efficient matrix multiplication and the vector units for optimized…
Tensor Core Units (TCUs) are hardware accelerators developed for deep neural networks, which efficiently support the multiplication of two dense $\sqrt{m}\times \sqrt{m}$ matrices, where $m$ is a given hardware parameter. In this paper, we…
We present efficient and scalable parallel algorithms for performing mathematical operations for low-rank tensors represented in the tensor train (TT) format. We consider algorithms for addition, elementwise multiplication, computing norms…
We have repurposed Google Tensor Processing Units (TPUs), application-specific chips developed for machine learning, into large-scale dense linear algebra supercomputers. The TPUs' fast inter-core interconnects (ICI)s, physically…
Linear-scaling electronic-structure techniques, also called O(N) techniques, rely heavily on the multiplication of sparse matrices, where the sparsity arises from spatial cut-offs. In order to treat very large systems, the calculations must…
SCAN (Structural Clustering Algorithm for Networks) is a well-studied, widely used graph clustering algorithm. For large graphs, however, sequential SCAN variants are prohibitively slow, and parallel SCAN variants do not effectively share…
In this paper we describe a parallel Gaussian elimination algorithm for matrices with entries in a finite field. Unlike previous approaches, our algorithm subdivides a very large input matrix into smaller submatrices by subdividing both…
Sparse general matrix-matrix multiplication (spGEMM) is an essential component in many scientific and data analytics applications. However, the sparsity pattern of the input matrices and the interaction of their patterns make spGEMM…
The DBSCAN method for spatial clustering has received significant attention due to its applicability in a variety of data analysis tasks. There are fast sequential algorithms for DBSCAN in Euclidean space that take $O(n\log n)$ work for two…
Designing problems using matrices is very important in Computer Science. Fields like graph computer, graphs theory, and machine learning use matrices very often to solve their own problems. The most often matrix operation is the…
Machine Learning approaches like clustering methods deal with massive datasets that present an increasing challenge. We devise parallel algorithms to compute the Multi-Slice Clustering (MSC) for 3rd-order tensors. The MSC method is based on…
The Linear Combination of Unitaries (LCU) method is a powerful scheme for the block encoding of operators but suffers from high overheads. In this work, we discuss the parallelisation of LCU and in particular the SELECT subroutine of LCU…
The tensor-train (TT) decomposition expresses a tensor in a data-sparse format used in molecular simulations, high-order correlation functions, and optimization. In this paper, we propose four parallelizable algorithms that compute the TT…
Parallel tensor network contraction algorithms have emerged as the pivotal benchmarks for assessing the classical limits of computation, exemplified by Google's demonstration of quantum supremacy through random circuit sampling. However,…
We propose a new parallel framework for fast computation of inverse and forward dynamics of articulated robots based on prefix sums (scans). We re-investigate the well-known recursive Newton-Euler formulation of robot dynamics and show that…
We show that a particular class of parallel algorithm for linear functions can be straightforwardly generalized to a parallel algorithm of their tensor product. The central idea is to take a model of parallel algorithms -- Bulk Synchronous…
In this paper, we present a concurrent implementation of a powerful topological thinning operator. This operator is able to act directly over grayscale images without modifying their topology. We introduce an adapted parallelization…
The Massive Parallel Computation (MPC) model is a theoretical framework for popular parallel and distributed platforms such as MapReduce, Hadoop, or Spark. We consider the task of computing a large matching or small vertex cover in this…