Related papers: Efficient Transformer-based Large Scale Language R…

Prune Once for All: Sparse Pre-Trained Language Models

Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase…

Computation and Language · Computer Science 2021-11-11 Ofir Zafrir , Ariel Larey , Guy Boudoukh , Haihao Shen , Moshe Wasserblat

Structured Pruning of a BERT-based Question Answering Model

The recent trend in industry-setting Natural Language Processing (NLP) research has been to operate large %scale pretrained language models like BERT under strict computational limits. While most model compression work has focused on…

Computation and Language · Computer Science 2021-04-13 J. S. McCarley , Rishav Chakravarti , Avirup Sil

Reweighted Proximal Pruning for Large-Scale Language Representation

Recently, pre-trained language representation flourishes as the mainstay of the natural language understanding community, e.g., BERT. These pre-trained language representations can create state-of-the-art results on a wide range of…

Machine Learning · Computer Science 2019-12-24 Fu-Ming Guo , Sijia Liu , Finlay S. Mungall , Xue Lin , Yanzhi Wang

Structured Pruning of Self-Supervised Pre-trained Models for Speech Recognition and Understanding

Self-supervised speech representation learning (SSL) has shown to be effective in various downstream tasks, but SSL models are usually large and slow. Model compression techniques such as pruning aim to reduce the model size and computation…

Computation and Language · Computer Science 2023-03-01 Yifan Peng , Kwangyoun Kim , Felix Wu , Prashant Sridhar , Shinji Watanabe

Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression

This paper presents Thanos, a novel weight-pruning algorithm designed to reduce the memory footprint and enhance the computational efficiency of large language models (LLMs) by removing redundant weights while maintaining accuracy. Thanos…

Machine Learning · Computer Science 2025-04-09 Ivan Ilin , Peter Richtarik

Iterative Structured Pruning for Large Language Models with Multi-Domain Calibration

Large Language Models (LLMs) have achieved remarkable success across a wide spectrum of natural language processing tasks. However, their ever-growing scale introduces significant barriers to real-world deployment, including substantial…

Computation and Language · Computer Science 2026-01-07 Guangxin Wu , Hao Zhang , Zhang Zhibin , Jiafeng Guo , Xueqi Cheng

Structured Pruning of Large Language Models

Large language models have recently achieved state of the art performance across a wide variety of natural language tasks. Meanwhile, the size of these models and their latency have significantly increased, which makes their usage costly,…

Computation and Language · Computer Science 2021-03-30 Ziheng Wang , Jeremy Wohlwend , Tao Lei

The LLM Surgeon

State-of-the-art language models are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data. However, the sheer size of the Transformer architectures makes it difficult to…

Machine Learning · Computer Science 2024-03-22 Tycho F. A. van der Ouderaa , Markus Nagel , Mart van Baalen , Yuki M. Asano , Tijmen Blankevoort

FlexiGPT: Pruning and Extending Large Language Models with Low-Rank Weight Sharing

The rapid proliferation of large language models (LLMs) in natural language processing (NLP) has created a critical need for techniques that enable efficient deployment on memory-constrained devices without compromising performance. We…

Computation and Language · Computer Science 2025-02-03 James Seale Smith , Chi-Heng Lin , Shikhar Tuli , Haris Jeelani , Shangqian Gao , Yilin Shen , Hongxia Jin , Yen-Chang Hsu

Gradient-Free Structured Pruning with Unlabeled Data

Large Language Models (LLMs) have achieved great success in solving difficult tasks across many domains, but such success comes with a high computation cost, and inference latency. As developers and third parties customize these models, the…

Machine Learning · Computer Science 2023-07-18 Azade Nova , Hanjun Dai , Dale Schuurmans

Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training

Small language models (SLMs) have attracted considerable attention from both academia and industry due to their broad range of applications in edge devices. To obtain SLMs with strong performance, conventional approaches either pre-train…

Machine Learning · Computer Science 2025-11-17 Rui Pan , Shivanshu Shekhar , Boyao Wang , Shizhe Diao , Jipeng Zhang , Xingyuan Pan , Renjie Pi , Tong Zhang

Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling

Many efforts have been made to facilitate natural language processing tasks with pre-trained language models (LMs), and brought significant improvements to various applications. To fully leverage the nearly unlimited corpora and capture…

Computation and Language · Computer Science 2018-09-11 Liyuan Liu , Xiang Ren , Jingbo Shang , Jian Peng , Jiawei Han

Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models

Despite the remarkable success of Large Language Models (LLMs), the massive size poses significant deployment challenges, particularly on resource-constrained hardware. While existing LLM compression methods focus on quantization, pruning…

Artificial Intelligence · Computer Science 2023-10-12 Song Guo , Jiahang Xu , Li Lyna Zhang , Mao Yang

The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

Transformer-based language models have become a key building block for natural language processing. While these models are extremely accurate, they can be too large and computationally intensive to run on standard deployments. A variety of…

Computation and Language · Computer Science 2022-10-19 Eldar Kurtic , Daniel Campos , Tuan Nguyen , Elias Frantar , Mark Kurtz , Benjamin Fineran , Michael Goin , Dan Alistarh

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

Large language models (LLMs) have proven to be highly effective across various natural language processing tasks. However, their large number of parameters poses significant challenges for practical deployment. Pruning, a technique aimed at…

Computation and Language · Computer Science 2024-12-16 Jiwon Song , Kyungseok Oh , Taesu Kim , Hyungjun Kim , Yulhwa Kim , Jae-Joon Kim

LLM-Pruner: On the Structural Pruning of Large Language Models

Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in both the…

Computation and Language · Computer Science 2023-09-29 Xinyin Ma , Gongfan Fang , Xinchao Wang

MaskPrune: Mask-based LLM Pruning for Layer-wise Uniform Structures

The remarkable performance of large language models (LLMs) in various language tasks has attracted considerable attention. However, the ever-increasing size of these models presents growing challenges for deployment and inference.…

Computation and Language · Computer Science 2025-02-21 Jiayu Qin , Jianchao Tan , Kefeng Zhang , Xunliang Cai , Wei Wang

Accurate and Structured Pruning for Efficient Automatic Speech Recognition

Automatic Speech Recognition (ASR) has seen remarkable advancements with deep neural networks, such as Transformer and Conformer. However, these models typically have large model sizes and high inference costs, posing a challenge to deploy…

Computation and Language · Computer Science 2023-06-01 Huiqiang Jiang , Li Lyna Zhang , Yuang Li , Yu Wu , Shijie Cao , Ting Cao , Yuqing Yang , Jinyu Li , Mao Yang , Lili Qiu

Can pruning make Large Language Models more efficient?

Transformer models have revolutionized natural language processing with their unparalleled ability to grasp complex contextual relationships. However, the vast number of parameters in these models has raised concerns regarding computational…

Machine Learning · Computer Science 2023-10-10 Sia Gholami , Marwan Omar

Entropy-Based Block Pruning for Efficient Large Language Models

As large language models continue to scale, their growing computational and storage demands pose significant challenges for real-world deployment. In this work, we investigate redundancy within Transformer-based models and propose an…

Computation and Language · Computer Science 2025-04-08 Liangwei Yang , Yuhui Xu , Juntao Tan , Doyen Sahoo , Silvio Savarese , Caiming Xiong , Huan Wang , Shelby Heinecke