Related papers: Can pruning make Large Language Models more effici…

Systematic Weight Evaluation for Pruning Large Language Models: Enhancing Performance and Sustainability

The exponential growth of large language models (LLMs) like ChatGPT has revolutionized artificial intelligence, offering unprecedented capabilities in natural language processing. However, the extensive computational resources required for…

Computation and Language · Computer Science 2025-02-25 Ashhadul Islam , Samir Brahim Belhaouari , Amine Bermak

A Comparative Study of Pruning Methods in Transformer-based Time Series Forecasting

The current landscape in time-series forecasting is dominated by Transformer-based models. Their high parameter count and corresponding demand in computational resources pose a challenge to real-world deployment, especially for commercial…

Machine Learning · Computer Science 2024-12-18 Nicholas Kiefer , Arvid Weyrauch , Muhammed Öz , Achim Streit , Markus Götz , Charlotte Debus

Large Language Model Pruning

We surely enjoy the larger the better models for their superior performance in the last couple of years when both the hardware and software support the birth of such extremely huge models. The applied fields include text mining and others.…

Computation and Language · Computer Science 2024-06-04 Hanjuan Huang , Hao-Jia Song , Hsing-Kuo Pao

Structured Pruning of Large Language Models

Large language models have recently achieved state of the art performance across a wide variety of natural language tasks. Meanwhile, the size of these models and their latency have significantly increased, which makes their usage costly,…

Computation and Language · Computer Science 2021-03-30 Ziheng Wang , Jeremy Wohlwend , Tao Lei

Pruning General Large Language Models into Customized Expert Models

Large language models (LLMs) have revolutionized natural language processing, yet their substantial model sizes often require substantial computational resources. To preserve computing resources and accelerate inference speed, it is crucial…

Computation and Language · Computer Science 2025-06-04 Yirao Zhao , Guizhen Chen , Kenji Kawaguchi , Lidong Bing , Wenxuan Zhang

Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling

While Transformer-based models have shown impressive language modeling performance, the large computation cost is often prohibitive for practical use. Attention head pruning, which removes unnecessary attention heads in the multihead…

Computation and Language · Computer Science 2021-10-08 Kyuhong Shim , Iksoo Choi , Wonyong Sung , Jungwook Choi

The LLM Surgeon

State-of-the-art language models are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data. However, the sheer size of the Transformer architectures makes it difficult to…

Machine Learning · Computer Science 2024-03-22 Tycho F. A. van der Ouderaa , Markus Nagel , Mart van Baalen , Yuki M. Asano , Tijmen Blankevoort

Neural Language Model Pruning for Automatic Speech Recognition

We study model pruning methods applied to Transformer-based neural network language models for automatic speech recognition. We explore three aspects of the pruning frame work, namely criterion, method and scheduler, analyzing their…

Machine Learning · Computer Science 2023-10-06 Leonardo Emili , Thiago Fraga-Silva , Ernest Pusateri , Markus Nußbaum-Thom , Youssef Oualil

Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models

Deep learning models have achieved tremendous success in most of the industries in recent years. The evolution of these models has also led to an increase in the model size and energy requirement, making it difficult to deploy in production…

Machine Learning · Computer Science 2024-07-24 Aayush Saxena , Arit Kumar Bishwas , Ayush Ashok Mishra , Ryan Armstrong

To prune, or not to prune: exploring the efficacy of pruning for model compression

Model pruning seeks to induce sparsity in a deep neural network's various connection matrices, thereby reducing the number of nonzero-valued parameters in the model. Recent reports (Han et al., 2015; Narang et al., 2017) prune deep networks…

Machine Learning · Statistics 2017-11-15 Michael Zhu , Suyog Gupta

Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling

Many efforts have been made to facilitate natural language processing tasks with pre-trained language models (LMs), and brought significant improvements to various applications. To fully leverage the nearly unlimited corpora and capture…

Computation and Language · Computer Science 2018-09-11 Liyuan Liu , Xiang Ren , Jingbo Shang , Jian Peng , Jiawei Han

Pruning Large Language Models via Accuracy Predictor

Large language models(LLMs) containing tens of billions of parameters (or even more) have demonstrated impressive capabilities in various NLP tasks. However, substantial model size poses challenges to training, inference, and deployment so…

Artificial Intelligence · Computer Science 2023-10-11 Yupeng Ji , Yibo Cao , Jiucai Liu

Numerical Pruning for Efficient Autoregressive Models

Transformers have emerged as the leading architecture in deep learning, proving to be versatile and highly effective across diverse domains beyond language and image processing. However, their impressive performance often incurs high…

Machine Learning · Computer Science 2024-12-18 Xuan Shen , Zhao Song , Yufa Zhou , Bo Chen , Jing Liu , Ruiyi Zhang , Ryan A. Rossi , Hao Tan , Tong Yu , Xiang Chen , Yufan Zhou , Tong Sun , Pu Zhao , Yanzhi Wang , Jiuxiang Gu

Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations

Large language models are ubiquitous in natural language processing because they can adapt to new tasks without retraining. However, their sheer scale and complexity present unique challenges and opportunities, prompting researchers and…

Computation and Language · Computer Science 2024-08-07 Leo Donisch , Sigurd Schacht , Carsten Lanquillon

Frustratingly Easy Task-aware Pruning for Large Language Models

Pruning provides a practical solution to reduce the resources required to run large language models (LLMs) to benefit from their effective capabilities as well as control their cost for training and inference. Research on LLM pruning often…

Computation and Language · Computer Science 2025-10-28 Yuanhe Tian , Junjie Liu , Xican Yang , Haishan Ye , Yan Song

Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers

Large Language Models (LLMs) possess outstanding capabilities in addressing various natural language processing (NLP) tasks. However, the sheer size of these models poses challenges in terms of storage, training and inference due to the…

Computation and Language · Computer Science 2025-04-18 Shuzhou Yuan , Ercong Nie , Bolei Ma , Michael Färber

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

Model pruning, i.e., removing a subset of model weights, has become a prominent approach to reducing the memory footprint of large language models (LLMs) during inference. Notably, popular inference engines, such as vLLM, enable users to…

Machine Learning · Computer Science 2026-04-07 Kazuki Egashira , Robin Staab , Thibaud Gloaguen , Mark Vero , Martin Vechev

Model Compression and Efficient Inference for Large Language Models: A Survey

Transformer based large language models have achieved tremendous success. However, the significant memory and computational costs incurred during the inference process make it challenging to deploy large models on resource-constrained…

Computation and Language · Computer Science 2024-02-16 Wenxiao Wang , Wei Chen , Yicong Luo , Yongliu Long , Zhengkai Lin , Liye Zhang , Binbin Lin , Deng Cai , Xiaofei He

Compression of Neural Machine Translation Models via Pruning

Neural Machine Translation (NMT), like many other deep learning domains, typically suffers from over-parameterization, resulting in large storage sizes. This paper examines three simple magnitude-based pruning schemes to compress NMT…

Artificial Intelligence · Computer Science 2016-07-01 Abigail See , Minh-Thang Luong , Christopher D. Manning

Efficiency optimization of large-scale language models based on deep learning in natural language processing tasks

The internal structure and operation mechanism of large-scale language models are analyzed theoretically, especially how Transformer and its derivative architectures can restrict computing efficiency while capturing long-term dependencies.…

Machine Learning · Computer Science 2024-05-21 Taiyuan Mei , Yun Zi , Xiaohan Cheng , Zijun Gao , Qi Wang , Haowei Yang