Related papers: AsyncSparse: Accelerating Sparse Matrix-Matrix Mul…

Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

General-purpose Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental kernel in scientific computing and deep learning. The emergence of new matrix computation units such as Tensor Cores (TCs) brings more opportunities for SpMM…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-17 Haisha Zhao , San Li , Jiaheng Wang , Chunbao Zhou , Jue Wang , Zhikuang Xin , Shunde Li , Zhiqiang Liang , Zhijie Pan , Fang Liu , Yan Zeng , Yangang Wang , Xuebin Chi

RSH-SpMM: A Row-Structured Hybrid Kernel for Sparse Matrix-Matrix Multiplication on GPUs

Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental computation in graph analytics, scientific simulation, and sparse deep learning workloads. However, the extreme irregularity of real-world sparse matrices prevents existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-11 Aiying Li , Jingwei Sun , Han Li , Wence Ji , Guangzhong Sun

Design Principles for Sparse Matrix Multiplication on the GPU

We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion.…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-13 Carl Yang , Aydin Buluc , John D. Owens

cuTeSpMM: Accelerating Sparse-Dense Matrix Multiplication using GPU Tensor Cores

Many recent GPUs feature matrix multiplication engines (aka Tensor Core Units or TCUs) that perform small fixed-size matrix-matrix products at very high throughput. They have been used very effectively to speed up dense matrix-matrix…

Performance · Computer Science 2025-11-25 Lizhi Xiang , Omid Asudeh , Gerald Sabin , Aravind Sukumaran-Rajam , P. Sadayappan

Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction

Sparse matrix-vector and matrix-matrix multiplication (SpMV and SpMM) are fundamental in both conventional (graph analytics, scientific computing) and emerging (sparse DNN, GNN) domains. Workload-balancing and parallel-reduction are…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-15 Guyue Huang , Guohao Dai , Yu Wang , Yufei Ding , Yuan Xie

GE-SpMM: General-purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks

Graph Neural Networks (GNNs) have achieved significant improvements in various domains. Sparse Matrix-Matrix multiplication (SpMM) is a fundamental operator in GNNs, which performs a multiplication between a sparse matrix and a dense…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-08 Guyue Huang , Guohao Dai , Yu Wang , Huazhong Yang

Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

Sparse general matrix-matrix multiplication (spGEMM) is an essential component in many scientific and data analytics applications. However, the sparsity pattern of the input matrices and the interaction of their patterns make spGEMM…

Mathematical Software · Computer Science 2020-10-01 Orestis Zachariadis , Nitin Satpute , Juan Gómez-Luna , Joaquín Olivares

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

Sparse Matrix-matrix Multiplication (SpMM) and Sampled Dense-dense Matrix Multiplication (SDDMM) are important sparse operators in scientific computing and deep learning. Tensor Core Units (TCUs) enhance modern accelerators with superior…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-17 Jinliang Shi , Shigang Li , Youxuan Xu , Rongtian Fu , Xueying Wang , Tong Wu

A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs

Sparse data structures are commonly used in neural networks to reduce the memory footprint. These data structures are compact but cause irregularities such as random memory accesses, which prevent efficient use of the memory hierarchy. GPUs…

Programming Languages · Computer Science 2025-06-19 Hossein Albakri , Kazem Cheshmi

High Performance Unstructured SpMM Computation Using Tensor Cores

High-performance sparse matrix-matrix (SpMM) multiplication is paramount for science and industry, as the ever-increasing sizes of data prohibit using dense data structures. Yet, existing hardware, such as Tensor Cores (TC), is ill-suited…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-22 Patrik Okanovic , Grzegorz Kwasniewski , Paolo Sylos Labini , Maciej Besta , Flavio Vella , Torsten Hoefler

ParamSpMM: Adaptive and Efficient Sparse Matrix-Matrix Multiplication on GPUs for GNNs

Fueled by the ability to mine real-world graph data, GNN applications have experienced phenomenal growth. Sparse Matrix-Matrix Multiplication (SpMM) is a critical operator in GNNs. However, existing SpMM designs for GNNs struggle to adapt…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-18 Lixing Zhang , Guanhua Ye , Hongzheng Li , Shigang Li , Yingxia Shao

AlphaSparse: Generating High Performance SpMV Codes Directly from Sparse Matrices

Sparse Matrix-Vector multiplication (SpMV) is an essential computational kernel in many application scenarios. Tens of sparse matrix formats and implementations have been proposed to compress the memory storage and speed up SpMV…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-22 Zhen Du , Jiajia Li , Yinshan Wang , Xueqi Li , Guangming Tan , Ninghui Sun

MSREP: A Fast yet Light Sparse Matrix Framework for Multi-GPU Systems

Sparse linear algebra kernels play a critical role in numerous applications, covering from exascale scientific simulation to large-scale data analytics. Offloading linear algebra kernels on one GPU will no longer be viable in these…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-19 Jieyang Chen , Chenhao Xie , Jesun S Firoz , Jiajia Li , Shuaiwen Leon Song , Kevin Barker , Mark Raugas , Ang Li

Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

This paper presents a low-overhead optimizer for the ubiquitous sparse matrix-vector multiplication (SpMV) kernel. Architectural diversity among different processors together with structural diversity among different sparse matrices lead to…

Performance · Computer Science 2017-11-16 Athena Elafrou , Georgios Goumas , Nektarios Koziris

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Although the matrix multiplication plays a vital role in computational linear algebra, there are few efficient solutions for matrix multiplication of the near-sparse matrices. The Sparse Approximate Matrix Multiply (SpAMM) is one of the…

Performance · Computer Science 2022-10-25 Xiaoyan Liu , Yi Liu , Ming Dun , Bohong Yin , Hailong Yang , Zhongzhi Luan , Depei Qian

Sparsity-Aware Roofline Models for Sparse Matrix-Matrix Multiplication

Sparse matrix-dense matrix multiplication (SpMM) is a critical kernel in scientific computing, graph analytics, and machine learning, whose performance is often constrained by memory bandwidth. In this work, we investigate the applicability…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-09 Matthew Qian , Yahia Ramadan , Suhita Anubha , Ariful Azad

HC-SpMM: Accelerating Sparse Matrix-Matrix Multiplication for Graphs with Hybrid GPU Cores

Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental operation in graph computing and analytics. However, the irregularity of real-world graphs poses significant challenges to achieving efficient SpMM operation for graph data on…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-13 Zhonggen Li , Xiangyu Ke , Yifan Zhu , Yunjun Gao , Yaofeng Tu

CB-SpMV:A Data Aggregating and Balance Algorithm for Cache-Friendly Block-Based SpMV on GPUs

Sparse matrix-vector multiplication (SpMV) is crucial in computational science, engineering, and machine learning. Despite substantial efforts to improve SpMV performance on GPUs through various techniques, issues related to data locality,…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-19 Xing Cong , Fukai Sun , Yifan Chen , Chenhao Xie* , Yi Liu , Depei Qian

Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3

In recent years, novel AI accelerators have emerged as promising alternatives to GPU for AI model training and inference tasks. One such accelerator, the Cerebras CS-3, achieves strong performance on large model training as well as…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-01 Milan Shah , Sheng Di , Michela Becchi

Multi-threaded Sparse Matrix-Matrix Multiplication for Many-Core and GPU Architectures

Sparse Matrix-Matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-10 Mehmet Deveci , Christian Trott , Sivasankaran Rajamanickam