Related papers: MSREP: A Fast yet Light Sparse Matrix Framework fo…

A Framework for General Sparse Matrix-Matrix Multiplication on GPUs and Heterogeneous Processors

General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method (AMG), breadth first search and shortest path problem. Compared to other sparse BLAS routines,…

Mathematical Software · Computer Science 2015-09-15 Weifeng Liu , Brian Vinter

Sparse GPU Kernels for Deep Learning

Scientific workloads have traditionally exploited high levels of sparsity to accelerate computation and reduce memory requirements. While deep neural networks can be made sparse, achieving practical speedups on GPUs is difficult because…

Machine Learning · Computer Science 2020-09-02 Trevor Gale , Matei Zaharia , Cliff Young , Erich Elsen

Multi-threaded Sparse Matrix-Matrix Multiplication for Many-Core and GPU Architectures

Sparse Matrix-Matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-10 Mehmet Deveci , Christian Trott , Sivasankaran Rajamanickam

Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining

Scaling up the sparse matrix-vector multiplication kernel on modern Graphics Processing Units (GPU) has been at the heart of numerous studies in both academia and industry. In this article we present a novel non-parametric, self-tunable,…

Numerical Analysis · Computer Science 2012-12-24 Xintian Yang , Srinivasan Parthasarathy , Ponnuswamy Sadayappan

AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures

Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental kernel across scientific computing and machine learning. While prior work accelerates SpMM using Tensor Cores, no existing sparse kernel exploits the asynchronous features of…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-21 Jie Liu , Huanzhi Pu , Zhiru Zhang

DBCSR: A Library for Dense Matrix Multiplications on Distributed GPU-Accelerated Systems

Most, if not all the modern scientific simulation packages utilize matrix algebra operations. Among the operation of the linear algebra, one of the most important kernels is the multiplication of matrices, dense and sparse. Examples of…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-14 Ilia Sivkov , Alfio Lazzaro , Juerg Hutter

RSH-SpMM: A Row-Structured Hybrid Kernel for Sparse Matrix-Matrix Multiplication on GPUs

Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental computation in graph analytics, scientific simulation, and sparse deep learning workloads. However, the extreme irregularity of real-world sparse matrices prevents existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-11 Aiying Li , Jingwei Sun , Han Li , Wence Ji , Guangzhong Sun

Heterogeneous Sparse Matrix-Vector Multiplication via Compressed Sparse Row Format

Sparse matrix-vector multiplication (SpMV) is one of the most important kernels in high-performance computing (HPC), yet SpMV normally suffers from ill performance on many devices. Due to ill performance, SpMV normally requires special care…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-09 Phillip Allen Lane , Joshua Dennis Booth

Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

This paper presents a low-overhead optimizer for the ubiquitous sparse matrix-vector multiplication (SpMV) kernel. Architectural diversity among different processors together with structural diversity among different sparse matrices lead to…

Performance · Computer Science 2017-11-16 Athena Elafrou , Georgios Goumas , Nektarios Koziris

Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3

In recent years, novel AI accelerators have emerged as promising alternatives to GPU for AI model training and inference tasks. One such accelerator, the Cerebras CS-3, achieves strong performance on large model training as well as…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-01 Milan Shah , Sheng Di , Michela Becchi

Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments

Generalized sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. Here we show that SpGEMM also yields efficient…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-03-19 Aydin Buluc , John Gilbert

Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra

Sparse-dense linear algebra is crucial in many domains, but challenging to handle efficiently on CPUs, GPUs, and accelerators alike; multiplications with sparse formats like CSR and CSF require indirect memory lookups. In this work, we…

Hardware Architecture · Computer Science 2020-12-15 Paul Scheffler , Florian Zaruba , Fabian Schuiki , Torsten Hoefler , Luca Benini

Multiresolution kernel matrix algebra

We propose a sparse algebra for samplet compressed kernel matrices, to enable efficient scattered data analysis. We show the compression of kernel matrices by means of samplets produces optimally sparse matrices in a certain S-format. It…

Numerical Analysis · Mathematics 2023-05-05 H. Harbrecht , M. Multerer , O. Schenk , Ch. Schwab

A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs

Sparse data structures are commonly used in neural networks to reduce the memory footprint. These data structures are compact but cause irregularities such as random memory accesses, which prevent efficient use of the memory hierarchy. GPUs…

Programming Languages · Computer Science 2025-06-19 Hossein Albakri , Kazem Cheshmi

Synergistic CPU-FPGA Acceleration of Sparse Linear Algebra

This paper describes REAP, a software-hardware approach that enables high performance sparse linear algebra computations on a cooperative CPU-FPGA platform. REAP carefully separates the task of organizing the matrix elements from the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-30 Mohammadreza Soltaniyeh , Richard P. Martin , Santosh Nagarakatte

dCSR: A Memory-Efficient Sparse Matrix Representation for Parallel Neural Network Inference

Reducing the memory footprint of neural networks is a crucial prerequisite for deploying them in small and low-cost embedded devices. Network parameters can often be reduced significantly through pruning. We discuss how to best represent…

Data Structures and Algorithms · Computer Science 2021-11-25 Elias Trommer , Bernd Waschneck , Akash Kumar

Sparse Stream Semantic Registers: A Lightweight ISA Extension Accelerating General Sparse Linear Algebra

Sparse linear algebra is crucial in many application domains, but challenging to handle efficiently in both software and hardware, with one- and two-sided operand sparsity handled with distinct approaches. In this work, we enhance an…

Hardware Architecture · Computer Science 2023-10-03 Paul Scheffler , Florian Zaruba , Fabian Schuiki , Torsten Hoefler , Luca Benini

Design Principles for Sparse Matrix Multiplication on the GPU

We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion.…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-13 Carl Yang , Aydin Buluc , John D. Owens

Ocean: Fast Estimation-Based Sparse General Matrix-Matrix Multiplication on GPU

In computational science and data analytics, many workloads involve irregular and sparse computations that are inherently difficult to optimize for modern hardware. A key kernel is Sparse General Matrix-Matrix Multiplication (SpGEMM), which…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-22 Yifan Li , Giulia Guidi

Exploiting dynamic sparse matrices for performance portable linear algebra operations

Sparse matrices and linear algebra are at the heart of scientific simulations. More than 70 sparse matrix storage formats have been developed over the years, targeting a wide range of hardware architectures and matrix types. Each format is…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-15 Chris Stylianou , Michele Weiland