Related papers: Indirection Stream Semantic Register Architecture …

Sparse Stream Semantic Registers: A Lightweight ISA Extension Accelerating General Sparse Linear Algebra

Sparse linear algebra is crucial in many application domains, but challenging to handle efficiently in both software and hardware, with one- and two-sided operand sparsity handled with distinct approaches. In this work, we enhance an…

Hardware Architecture · Computer Science 2023-10-03 Paul Scheffler , Florian Zaruba , Fabian Schuiki , Torsten Hoefler , Luca Benini

IndexMAC: A Custom RISC-V Vector Instruction to Accelerate Structured-Sparse Matrix Multiplications

Structured sparsity has been proposed as an efficient way to prune the complexity of modern Machine Learning (ML) applications and to simplify the handling of sparse data in hardware. The acceleration of ML models - for both training and…

Hardware Architecture · Computer Science 2023-11-14 V. Titopoulos , K. Alexandridis , C. Peltekis , C. Nicopoulos , G. Dimitrakopoulos

Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores

Single-issue processor cores are very energy efficient but suffer from the von Neumann bottleneck, in that they must explicitly fetch and issue the loads/storse necessary to feed their ALU/FPU. Each instruction spent on moving data is a…

Hardware Architecture · Computer Science 2020-04-02 Fabian Schuiki , Florian Zaruba , Torsten Hoefler , Luca Benini

Near-Memory Parallel Indexing and Coalescing: Enabling Highly Efficient Indirect Access for SpMV

Sparse matrix vector multiplication (SpMV) is central to numerous data-intensive applications, but requires streaming indirect memory accesses that severely degrade both processing and memory throughput in state-of-the-art architectures.…

Hardware Architecture · Computer Science 2023-11-20 Chi Zhang , Paul Scheffler , Thomas Benz , Matteo Perotti , Luca Benini

Sparse GPU Kernels for Deep Learning

Scientific workloads have traditionally exploited high levels of sparsity to accelerate computation and reduce memory requirements. While deep neural networks can be made sparse, achieving practical speedups on GPUs is difficult because…

Machine Learning · Computer Science 2020-09-02 Trevor Gale , Matei Zaharia , Cliff Young , Erich Elsen

MSREP: A Fast yet Light Sparse Matrix Framework for Multi-GPU Systems

Sparse linear algebra kernels play a critical role in numerous applications, covering from exascale scientific simulation to large-scale data analytics. Offloading linear algebra kernels on one GPU will no longer be viable in these…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-19 Jieyang Chen , Chenhao Xie , Jesun S Firoz , Jiajia Li , Shuaiwen Leon Song , Kevin Barker , Mark Raugas , Ang Li

Architecture Specific Generation of Large Scale Lattice Boltzmann Methods for Sparse Complex Geometries

We implement and analyse a sparse / indirect-addressing data structure for the Lattice Boltzmann Method to support efficient compute kernels for fluid dynamics problems with a high number of non-fluid nodes in the domain, such as in porous…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-14 Philipp Suffa , Markus Holzer , Harald Köstler , Ulrich Rüde

SARIS: Accelerating Stencil Computations on Energy-Efficient RISC-V Compute Clusters with Indirect Stream Registers

Stencil codes are performance-critical in many compute-intensive applications, but suffer from significant address calculation and irregular memory access overheads. This work presents SARIS, a general and highly flexible methodology for…

Mathematical Software · Computer Science 2024-04-09 Paul Scheffler , Luca Colagrande , Luca Benini

SpikeStream: Accelerating Spiking Neural Network Inference on RISC-V Clusters with Sparse Computation Extensions

Spiking Neural Network (SNN) inference has a clear potential for high energy efficiency as computation is triggered by events. However, the inherent sparsity of events poses challenges for conventional computing systems, driving the…

Hardware Architecture · Computer Science 2025-04-09 Simone Manoni , Paul Scheffler , Luca Zanatta , Andrea Acquaviva , Luca Benini , Andrea Bartolini

Exact Sparse Matrix-Vector Multiplication on GPU's and Multicore Architectures

We propose different implementations of the sparse matrix--dense vector multiplication (\spmv{}) for finite fields and rings $\Zb/m\Zb$. We take advantage of graphic card processors (GPU) and multi-core architectures. Our aim is to improve…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-09-09 Brice Boyer , Jean-Guillaume Dumas , Pascal Giorgi

DBCSR: A Library for Dense Matrix Multiplications on Distributed GPU-Accelerated Systems

Most, if not all the modern scientific simulation packages utilize matrix algebra operations. Among the operation of the linear algebra, one of the most important kernels is the multiplication of matrices, dense and sparse. Examples of…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-14 Ilia Sivkov , Alfio Lazzaro , Juerg Hutter

Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

General-purpose Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental kernel in scientific computing and deep learning. The emergence of new matrix computation units such as Tensor Cores (TCs) brings more opportunities for SpMM…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-17 Haisha Zhao , San Li , Jiaheng Wang , Chunbao Zhou , Jue Wang , Zhikuang Xin , Shunde Li , Zhiqiang Liang , Zhijie Pan , Fang Liu , Yan Zeng , Yangang Wang , Xuebin Chi

dCSR: A Memory-Efficient Sparse Matrix Representation for Parallel Neural Network Inference

Reducing the memory footprint of neural networks is a crucial prerequisite for deploying them in small and low-cost embedded devices. Network parameters can often be reduced significantly through pruning. We discuss how to best represent…

Data Structures and Algorithms · Computer Science 2021-11-25 Elias Trommer , Bernd Waschneck , Akash Kumar

Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3

In recent years, novel AI accelerators have emerged as promising alternatives to GPU for AI model training and inference tasks. One such accelerator, the Cerebras CS-3, achieves strong performance on large model training as well as…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-01 Milan Shah , Sheng Di , Michela Becchi

Insum: Sparse GPU Kernels Simplified and Optimized with Indirect Einsums

Programming high-performance sparse GPU kernels is notoriously difficult, requiring both substantial effort and deep expertise. Sparse compilers aim to simplify this process, but existing systems fall short in two key ways. First, they are…

Programming Languages · Computer Science 2025-10-21 Jaeyeon Won , Willow Ahrens , Joel S. Emer , Saman Amarasinghe

Design Principles for Sparse Matrix Multiplication on the GPU

We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion.…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-13 Carl Yang , Aydin Buluc , John D. Owens

Accelerating Sparse Deep Neural Networks

As neural network model sizes have dramatically increased, so has the interest in various techniques to reduce their parameter counts and accelerate their execution. An active area of research in this field is sparsity - encouraging zero…

Machine Learning · Computer Science 2021-04-20 Asit Mishra , Jorge Albericio Latorre , Jeff Pool , Darko Stosic , Dusan Stosic , Ganesh Venkatesh , Chong Yu , Paulius Micikevicius

A Low-Power Sparse Deep Learning Accelerator with Optimized Data Reuse

Sparse deep learning has reduced computation significantly, but its irregular non-zero data distribution complicates the data flow and hinders data reuse, increasing on-chip SRAM access and thus power consumption of the chip. This paper…

Hardware Architecture · Computer Science 2025-03-26 Kai-Chieh Hsu , Tian-Sheuan Chang

Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms

Dense linear algebra kernels are critical for wireless applications, and the oncoming proliferation of 5G only amplifies their importance. Many such matrix algorithms are inductive, and exhibit ample amounts of fine-grain ordered…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-05-16 Jian Weng , Vidushi Dadu , Tony Nowatzki

Sparse Matrix Multiplication on CAM Based Accelerator

Sparse matrix multiplication is an important component of linear algebra computations. In this paper, an architecture based on Content Addressable Memory (CAM) and Resistive Content Addressable Memory (ReCAM) is proposed for accelerating…

Hardware Architecture · Computer Science 2017-05-30 Leonid Yavits , Ran Ginosar