Related papers: Distributed-Memory Sparse Kernels for Machine Lear…

FusedMM: A Unified SDDMM-SpMM Kernel for Graph Embedding and Graph Neural Networks

We develop a fused matrix multiplication kernel that unifies sampled dense-dense matrix multiplication and sparse-dense matrix multiplication under a single operation called FusedMM. By using user-defined functions, FusedMM can capture…

Machine Learning · Computer Science 2021-10-28 Md. Khaledur Rahman , Majedul Haque Sujon , Ariful Azad

SpComm3D: A Framework for Enabling Sparse Communication in 3D Sparse Kernels

Existing 3D algorithms for distributed-memory sparse kernels suffer from limited scalability due to reliance on bulk sparsity-agnostic communication. While easier to use, sparsity-agnostic communication leads to unnecessary bandwidth and…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-01 Nabil Abubaker , Torsten Hoefler

Performance Optimization of Deep Learning Sparse Matrix Kernels on Intel Max Series GPU

In this paper, we focus on three sparse matrix operations that are relevant for machine learning applications, namely, the sparse-dense matrix multiplication (SPMM), the sampled dense-dense matrix multiplication (SDDMM), and the composition…

Machine Learning · Computer Science 2023-11-02 Mohammad Zubair , Christoph Bauinger

Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication

We consider a sparse matrix-matrix multiplication (SpGEMM) setting where one matrix is square and the other is tall and skinny. This special variant, called TS-SpGEMM, has important applications in multi-source breadth-first search,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-23 Isuru Ranawaka , Md Taufique Hussain , Charles Block , Gerasimos Gerogiannis , Josep Torrellas , Ariful Azad

SHIRO: Near-Optimal Communication Strategies for Distributed Sparse Matrix Multiplication

Distributed Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental operation in high-performance computing and deep learning applications. The major performance bottleneck in distributed SpMM lies in substantial communication overhead,…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-14 Chen Zhuang , Lingqi Zhang , Benjamin Brock , Du Wu , Peng Chen , Toshio Endo , Satoshi Matsuoka , Mohamed Wahib

A sparsity-aware distributed-memory algorithm for sparse-sparse matrix multiplication

Multiplying two sparse matrices (SpGEMM) is a common computational primitive used in many areas including graph algorithms, bioinformatics, algebraic multigrid solvers, and randomized sketching. Distributed-memory parallel algorithms for…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-28 Yuxi Hong , Aydin Buluc

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

Sparse Matrix-matrix Multiplication (SpMM) and Sampled Dense-dense Matrix Multiplication (SDDMM) are important sparse operators in scientific computing and deep learning. Tensor Core Units (TCUs) enhance modern accelerators with superior…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-17 Jinliang Shi , Shigang Li , Youxuan Xu , Rongtian Fu , Xueying Wang , Tong Wu

Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale

Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in various graph, scientific computing and machine learning algorithms. In this paper, we consider SpGEMMs performed on hundreds of thousands of processors generating…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-19 Md Taufique Hussain , Oguz Selvitopi , Aydin Buluç , Ariful Azad

Sparsity-Aware Communication for Distributed Graph Neural Network Training

Graph Neural Networks (GNNs) are a computationally efficient method to learn embeddings and classifications on graph data. However, GNN training has low computational intensity, making communication costs the bottleneck for scalability.…

Machine Learning · Computer Science 2025-04-08 Ujjaini Mukhodopadhyay , Alok Tripathy , Oguz Selvitopi , Katherine Yelick , Aydin Buluc

Fused3S: Fast Sparse Attention on Tensor Cores

Sparse attention is a core building block in many leading neural network models, from graph-structured learning to sparse sequence modeling. It can be decomposed into a sequence of three sparse matrix operations (3S): sampled dense-dense…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-14 Zitong Li , Aparna Chandramowlishwaran

Towards Communication-efficient Federated Learning via Sparse and Aligned Adaptive Optimization

Adaptive moment estimation (Adam), as a Stochastic Gradient Descent (SGD) variant, has gained widespread popularity in federated learning (FL) due to its fast convergence. However, federated Adam (FedAdam) algorithms suffer from a threefold…

Machine Learning · Computer Science 2025-09-22 Xiumei Deng , Jun Li , Kang Wei , Long Shi , Zehui Xiong , Ming Ding , Wen Chen , Shi Jin , H. Vincent Poor

Semi-External Memory Sparse Matrix Multiplication for Billion-Node Graphs

Sparse matrix multiplication is traditionally performed in memory and scales to large matrices using the distributed memory of multiple nodes. In contrast, we scale sparse matrix multiplication beyond memory capacity by implementing sparse…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-15 Da Zheng , Disa Mhembere , Vince Lyzinski , Joshua Vogelstein , Carey E. Priebe , Randal Burns

Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction

Sparse matrix-vector and matrix-matrix multiplication (SpMV and SpMM) are fundamental in both conventional (graph analytics, scientific computing) and emerging (sparse DNN, GNN) domains. Workload-balancing and parallel-reduction are…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-15 Guyue Huang , Guohao Dai , Yu Wang , Yufei Ding , Yuan Xie

Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3

In recent years, novel AI accelerators have emerged as promising alternatives to GPU for AI model training and inference tasks. One such accelerator, the Cerebras CS-3, achieves strong performance on large model training as well as…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-01 Milan Shah , Sheng Di , Michela Becchi

Design Principles for Sparse Matrix Multiplication on the GPU

We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion.…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-13 Carl Yang , Aydin Buluc , John D. Owens

Algorithms for Parallel Shared-Memory Sparse Matrix-Vector Multiplication on Unstructured Matrices

The sparse matrix-vector (SpMV) multiplication is an important computational kernel, but it is notoriously difficult to execute efficiently. This paper investigates algorithm performance for unstructured sparse matrices, which are more…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-27 Kobe Bergmans , Karl Meerbergen , Raf Vandebril

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs

Sparse matrix multiplication is an important kernel for large-scale graph processing and other data-intensive applications. In this paper, we implement various asynchronous, RDMA-based sparse times dense (SpMM) and sparse times sparse…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-06 Benjamin Brock , Aydın Buluç , Katherine Yelick

Communication-Efficient, 2D Parallel Stochastic Gradient Descent for Distributed-Memory Optimization

Distributed-memory implementations of numerical optimization algorithm, such as stochastic gradient descent (SGD), require interprocessor communication at every iteration of the algorithm. On modern distributed-memory clusters where…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-14 Aditya Devarakonda , Ramakrishnan Kannan

Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format

Multiplication of a sparse matrix to a dense matrix (SpDM) is widely used in many areas like scientific computing and machine learning. However, existing works under-look the performance optimization of SpDM on modern many-core…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-01 Shaohuai Shi , Qiang Wang , Xiaowen Chu

Sparsity-Aware Roofline Models for Sparse Matrix-Matrix Multiplication

Sparse matrix-dense matrix multiplication (SpMM) is a critical kernel in scientific computing, graph analytics, and machine learning, whose performance is often constrained by memory bandwidth. In this work, we investigate the applicability…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-09 Matthew Qian , Yahia Ramadan , Suhita Anubha , Ariful Azad