Related papers: An Algorithm-Hardware Co-Optimized Framework for A…

FPGA Co-Design for Efficient N:M Sparse and Quantized Model Inference

Large language models (LLMs) have demonstrated remarkable performance across a wide range of language processing tasks. However, this success comes at the cost of substantial computation and memory requirements, which significantly impedes…

Machine Learning · Computer Science 2026-01-21 Fen-Yu Hsieh , Yun-Chang Teng , Ding-Yong Hong , Jan-Jan Wu

Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs

To date, 2:4 sparsity has stood as the only sparse pattern that can be accelerated using sparse tensor cores on GPUs. In practice, 2:4 sparsity often possesses low actual speedups ($\leq 1.3$) and requires fixed sparse ratios, meaning that…

Machine Learning · Computer Science 2025-06-04 Kang Zhao , Tao Yuan , Han Bao , Zhenfeng Su , Chang Gao , Zhaofeng Sun , Zichen Liang , Liping Jing , Jianfei Chen

NM-SpMM: Accelerating Matrix Multiplication Using N:M Sparsity with GPGPU

Deep learning demonstrates effectiveness across a wide range of tasks. However, the dense and over-parameterized nature of these models results in significant resource consumption during deployment. In response to this issue, weight…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-05 Cong Ma , Du Wu , Zhelang Deng , Jiang Chen , Xiaowen Huang , Jintao Meng , Wenxi Zhu , Bingqiang Wang , Amelie Chi Zhou , Peng Chen , Minwen Deng , Yanjie Wei , Shengzhong Feng , Yi Pan

Accelerating Sparse Transformer Inference on GPU

Large language models (LLMs) are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask…

Machine Learning · Computer Science 2026-05-29 Wenhao Dai , Haodong Deng , Mengfei Rong , Xinyu Yang , Hongyu Liu , Fangxin Liu , Hailong Yang , Qianwen Cao , Qingxiao Sun

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity

Network pruning can reduce the high computation cost of deep neural network (DNN) models. However, to maintain their accuracies, sparse models often carry randomly-distributed weights, leading to irregular computations. Consequently, sparse…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-01 Cong Guo , Bo Yang Hsueh , Jingwen Leng , Yuxian Qiu , Yue Guan , Zehuan Wang , Xiaoying Jia , Xipeng Li , Minyi Guo , Yuhao Zhu

S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training

Training deep neural networks (DNNs) is costly. Fortunately, Nvidia Ampere and Hopper GPUs can accelerate matrix multiplications twice as fast as a dense equivalent by implementing 2:4 sparsity. However, previous STE-based 2:4 pre-training…

Machine Learning · Computer Science 2024-12-30 Yuezhou Hu , Jun Zhu , Jianfei Chen

Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization

State-of-the-art Transformer-based models, with gigantic parameters, are difficult to be accommodated on resource constrained embedded devices. Moreover, with the development of technology, more and more embedded devices are available to…

Machine Learning · Computer Science 2021-10-20 Panjie Qi , Edwin Hsing-Mean Sha , Qingfeng Zhuge , Hongwu Peng , Shaoyi Huang , Zhenglun Kong , Yuhong Song , Bingbing Li

A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining

Transformers are considered one of the most important deep learning models since 2018, in part because it establishes state-of-the-art (SOTA) records and could potentially replace existing Deep Neural Networks (DNNs). Despite the remarkable…

Machine Learning · Computer Science 2022-08-23 Hongwu Peng , Shaoyi Huang , Shiyang Chen , Bingbing Li , Tong Geng , Ang Li , Weiwen Jiang , Wujie Wen , Jinbo Bi , Hang Liu , Caiwen Ding

Accelerating Transformer Pre-training with 2:4 Sparsity

Training large transformers is slow, but recent innovations on GPU architecture give us an advantage. NVIDIA Ampere GPUs can execute a fine-grained 2:4 sparse matrix multiplication twice as fast as its dense equivalent. In the light of this…

Machine Learning · Computer Science 2024-10-29 Yuezhou Hu , Kang Zhao , Weiyu Huang , Jianfei Chen , Jun Zhu

TSENOR: Highly-Efficient Algorithm for Finding Transposable N:M Sparse Masks

Network pruning reduces the computational requirements of large neural networks, with N:M sparsity -- retaining only N out of every M consecutive weights -- offering a compelling balance between compressed model quality and hardware…

Machine Learning · Computer Science 2025-06-02 Xiang Meng , Mehdi Makni , Rahul Mazumder

Balanced Sparsity for Efficient DNN Inference on GPU

In trained deep neural networks, unstructured pruning can reduce redundant weights to lower storage cost. However, it requires the customization of hardwares to speed up practical inference. Another trend accelerates sparse model inference…

Computer Vision and Pattern Recognition · Computer Science 2020-10-30 Zhuliang Yao , Shijie Cao , Wencong Xiao , Chen Zhang , Lanshun Nie

Accelerating Sparse DNNs Based on Tiled GEMM

Network pruning can reduce the computation cost of deep neural network (DNN) models. However, sparse models often produce randomly-distributed weights to maintain accuracy, leading to irregular computations. Consequently, unstructured…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-02-19 Cong Guo , Fengchen Xue , Jingwen Leng , Yuxian Qiu , Yue Guan , Weihao Cui , Quan Chen , Minyi Guo

Toward Efficient Permutation for Hierarchical N:M Sparsity on GPUs

N:M sparsity pruning is a powerful technique for compressing deep neural networks, utilizing NVIDIA's Sparse Tensor Core technology. This method benefits from hardware support for sparse indexing, enabling the adoption of fine-grained…

Machine Learning · Computer Science 2024-07-31 Seungmin Yu , Xiaodie Yi , Hayun Lee , Dongkun Shin

Training Recipe for N:M Structured Sparsity with Decaying Pruning Mask

Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DNNs). Among different categories of sparsity, structured sparsity has gained more attention due to its efficient execution on modern…

Machine Learning · Computer Science 2022-09-19 Sheng-Chun Kao , Amir Yazdanbakhsh , Suvinay Subramanian , Shivani Agrawal , Utku Evci , Tushar Krishna

To 2:4 Sparsity and Beyond: Neuron-level Activation Function to Accelerate LLM Pre-Training

Trainings of Large Language Models are generally bottlenecked by matrix multiplications. In the Transformer architecture, a large portion of these operations happens in the Feed Forward Network (FFN), and this portion increases for larger…

Machine Learning · Computer Science 2026-02-09 Meghana Madhyastha , Daniel Haziza , Jesse Cai , Newsha Ardalani , Zhiqi Bu , Carole-Jean Wu

Efficient Dynamic Structured Sparse Training with Learned Shuffles

Structured sparsity accelerates training and inference on modern GPUs, yet it still trails unstructured dynamic sparse training (DST) in accuracy. The shortfall stems from a loss of expressivity: whereas a dense layer can realize every…

Machine Learning · Computer Science 2025-10-17 Abhishek Tyagi , Arjun Iyer , Liam Young , William H Renninger , Christopher Kanan , Yuhao Zhu

Accelerating Sparse Deep Neural Networks

As neural network model sizes have dramatically increased, so has the interest in various techniques to reduce their parameter counts and accelerate their execution. An active area of research in this field is sparsity - encouraging zero…

Machine Learning · Computer Science 2021-04-20 Asit Mishra , Jorge Albericio Latorre , Jeff Pool , Darko Stosic , Dusan Stosic , Ganesh Venkatesh , Chong Yu , Paulius Micikevicius

Enabling Unstructured Sparse Acceleration on Structured Sparse Accelerators

Exploiting sparsity in deep neural networks (DNNs) has been a promising area for meeting the growing computation requirements. To minimize the overhead of sparse acceleration, hardware designers have proposed structured sparsity support,…

Machine Learning · Computer Science 2025-05-27 Geonhwa Jeong , Po-An Tsai , Abhimanyu R. Bambhaniya , Stephen W. Keckler , Tushar Krishna

Efficient N:M Sparse DNN Training Using Algorithm, Architecture, and Dataflow Co-Design

Sparse training is one of the promising techniques to reduce the computational cost of DNNs while retaining high accuracy. In particular, N:M fine-grained structured sparsity, where only N out of consecutive M elements can be nonzero, has…

Machine Learning · Computer Science 2023-09-25 Chao Fang , Wei Sun , Aojun Zhou , Zhongfeng Wang

E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity

Traditional pruning methods are known to be challenging to work in Large Language Models (LLMs) for Generative AI because of their unaffordable training process and large computational demands. For the first time, we introduce the…

Machine Learning · Computer Science 2024-03-25 Yun Li , Lin Niu , Xipeng Zhang , Kai Liu , Jianchen Zhu , Zhanhui Kang