English
Related papers

Related papers: An Algorithm-Hardware Co-Optimized Framework for A…

200 papers

Large language models (LLMs) have demonstrated remarkable performance across a wide range of language processing tasks. However, this success comes at the cost of substantial computation and memory requirements, which significantly impedes…

Machine Learning · Computer Science 2026-01-21 Fen-Yu Hsieh , Yun-Chang Teng , Ding-Yong Hong , Jan-Jan Wu

To date, 2:4 sparsity has stood as the only sparse pattern that can be accelerated using sparse tensor cores on GPUs. In practice, 2:4 sparsity often possesses low actual speedups ($\leq 1.3$) and requires fixed sparse ratios, meaning that…

Machine Learning · Computer Science 2025-06-04 Kang Zhao , Tao Yuan , Han Bao , Zhenfeng Su , Chang Gao , Zhaofeng Sun , Zichen Liang , Liping Jing , Jianfei Chen

Deep learning demonstrates effectiveness across a wide range of tasks. However, the dense and over-parameterized nature of these models results in significant resource consumption during deployment. In response to this issue, weight…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-05 Cong Ma , Du Wu , Zhelang Deng , Jiang Chen , Xiaowen Huang , Jintao Meng , Wenxi Zhu , Bingqiang Wang , Amelie Chi Zhou , Peng Chen , Minwen Deng , Yanjie Wei , Shengzhong Feng , Yi Pan

Large language models (LLMs) are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask…

Machine Learning · Computer Science 2026-05-29 Wenhao Dai , Haodong Deng , Mengfei Rong , Xinyu Yang , Hongyu Liu , Fangxin Liu , Hailong Yang , Qianwen Cao , Qingxiao Sun

Network pruning can reduce the high computation cost of deep neural network (DNN) models. However, to maintain their accuracies, sparse models often carry randomly-distributed weights, leading to irregular computations. Consequently, sparse…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-01 Cong Guo , Bo Yang Hsueh , Jingwen Leng , Yuxian Qiu , Yue Guan , Zehuan Wang , Xiaoying Jia , Xipeng Li , Minyi Guo , Yuhao Zhu

Training deep neural networks (DNNs) is costly. Fortunately, Nvidia Ampere and Hopper GPUs can accelerate matrix multiplications twice as fast as a dense equivalent by implementing 2:4 sparsity. However, previous STE-based 2:4 pre-training…

Machine Learning · Computer Science 2024-12-30 Yuezhou Hu , Jun Zhu , Jianfei Chen

State-of-the-art Transformer-based models, with gigantic parameters, are difficult to be accommodated on resource constrained embedded devices. Moreover, with the development of technology, more and more embedded devices are available to…

Machine Learning · Computer Science 2021-10-20 Panjie Qi , Edwin Hsing-Mean Sha , Qingfeng Zhuge , Hongwu Peng , Shaoyi Huang , Zhenglun Kong , Yuhong Song , Bingbing Li

Transformers are considered one of the most important deep learning models since 2018, in part because it establishes state-of-the-art (SOTA) records and could potentially replace existing Deep Neural Networks (DNNs). Despite the remarkable…

Machine Learning · Computer Science 2022-08-23 Hongwu Peng , Shaoyi Huang , Shiyang Chen , Bingbing Li , Tong Geng , Ang Li , Weiwen Jiang , Wujie Wen , Jinbo Bi , Hang Liu , Caiwen Ding

Training large transformers is slow, but recent innovations on GPU architecture give us an advantage. NVIDIA Ampere GPUs can execute a fine-grained 2:4 sparse matrix multiplication twice as fast as its dense equivalent. In the light of this…

Machine Learning · Computer Science 2024-10-29 Yuezhou Hu , Kang Zhao , Weiyu Huang , Jianfei Chen , Jun Zhu

Network pruning reduces the computational requirements of large neural networks, with N:M sparsity -- retaining only N out of every M consecutive weights -- offering a compelling balance between compressed model quality and hardware…

Machine Learning · Computer Science 2025-06-02 Xiang Meng , Mehdi Makni , Rahul Mazumder

In trained deep neural networks, unstructured pruning can reduce redundant weights to lower storage cost. However, it requires the customization of hardwares to speed up practical inference. Another trend accelerates sparse model inference…

Computer Vision and Pattern Recognition · Computer Science 2020-10-30 Zhuliang Yao , Shijie Cao , Wencong Xiao , Chen Zhang , Lanshun Nie

Network pruning can reduce the computation cost of deep neural network (DNN) models. However, sparse models often produce randomly-distributed weights to maintain accuracy, leading to irregular computations. Consequently, unstructured…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-02-19 Cong Guo , Fengchen Xue , Jingwen Leng , Yuxian Qiu , Yue Guan , Weihao Cui , Quan Chen , Minyi Guo

N:M sparsity pruning is a powerful technique for compressing deep neural networks, utilizing NVIDIA's Sparse Tensor Core technology. This method benefits from hardware support for sparse indexing, enabling the adoption of fine-grained…

Machine Learning · Computer Science 2024-07-31 Seungmin Yu , Xiaodie Yi , Hayun Lee , Dongkun Shin

Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DNNs). Among different categories of sparsity, structured sparsity has gained more attention due to its efficient execution on modern…

Machine Learning · Computer Science 2022-09-19 Sheng-Chun Kao , Amir Yazdanbakhsh , Suvinay Subramanian , Shivani Agrawal , Utku Evci , Tushar Krishna

Trainings of Large Language Models are generally bottlenecked by matrix multiplications. In the Transformer architecture, a large portion of these operations happens in the Feed Forward Network (FFN), and this portion increases for larger…

Machine Learning · Computer Science 2026-02-09 Meghana Madhyastha , Daniel Haziza , Jesse Cai , Newsha Ardalani , Zhiqi Bu , Carole-Jean Wu

Structured sparsity accelerates training and inference on modern GPUs, yet it still trails unstructured dynamic sparse training (DST) in accuracy. The shortfall stems from a loss of expressivity: whereas a dense layer can realize every…

Machine Learning · Computer Science 2025-10-17 Abhishek Tyagi , Arjun Iyer , Liam Young , William H Renninger , Christopher Kanan , Yuhao Zhu

As neural network model sizes have dramatically increased, so has the interest in various techniques to reduce their parameter counts and accelerate their execution. An active area of research in this field is sparsity - encouraging zero…

Exploiting sparsity in deep neural networks (DNNs) has been a promising area for meeting the growing computation requirements. To minimize the overhead of sparse acceleration, hardware designers have proposed structured sparsity support,…

Machine Learning · Computer Science 2025-05-27 Geonhwa Jeong , Po-An Tsai , Abhimanyu R. Bambhaniya , Stephen W. Keckler , Tushar Krishna

Sparse training is one of the promising techniques to reduce the computational cost of DNNs while retaining high accuracy. In particular, N:M fine-grained structured sparsity, where only N out of consecutive M elements can be nonzero, has…

Machine Learning · Computer Science 2023-09-25 Chao Fang , Wei Sun , Aojun Zhou , Zhongfeng Wang

Traditional pruning methods are known to be challenging to work in Large Language Models (LLMs) for Generative AI because of their unaffordable training process and large computational demands. For the first time, we introduce the…

Machine Learning · Computer Science 2024-03-25 Yun Li , Lin Niu , Xipeng Zhang , Kai Liu , Jianchen Zhu , Zhanhui Kang
‹ Prev 1 2 3 10 Next ›