Related papers: Efficient Contextualized Representation: Language …

Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers

Large Language Models (LLMs) possess outstanding capabilities in addressing various natural language processing (NLP) tasks. However, the sheer size of these models poses challenges in terms of storage, training and inference due to the…

Computation and Language · Computer Science 2025-04-18 Shuzhou Yuan , Ercong Nie , Bolei Ma , Michael Färber

Large Language Model Pruning

We surely enjoy the larger the better models for their superior performance in the last couple of years when both the hardware and software support the birth of such extremely huge models. The applied fields include text mining and others.…

Computation and Language · Computer Science 2024-06-04 Hanjuan Huang , Hao-Jia Song , Hsing-Kuo Pao

Frustratingly Easy Task-aware Pruning for Large Language Models

Pruning provides a practical solution to reduce the resources required to run large language models (LLMs) to benefit from their effective capabilities as well as control their cost for training and inference. Research on LLM pruning often…

Computation and Language · Computer Science 2025-10-28 Yuanhe Tian , Junjie Liu , Xican Yang , Haishan Ye , Yan Song

A Survey on Model Compression for Large Language Models

Large Language Models (LLMs) have transformed natural language processing tasks successfully. Yet, their large size and high computational needs pose challenges for practical use, especially in resource-limited settings. Model compression…

Computation and Language · Computer Science 2024-07-31 Xunyu Zhu , Jian Li , Yong Liu , Can Ma , Weiping Wang

Pruning General Large Language Models into Customized Expert Models

Large language models (LLMs) have revolutionized natural language processing, yet their substantial model sizes often require substantial computational resources. To preserve computing resources and accelerate inference speed, it is crucial…

Computation and Language · Computer Science 2025-06-04 Yirao Zhao , Guizhen Chen , Kenji Kawaguchi , Lidong Bing , Wenxuan Zhang

DReSS: Data-driven Regularized Structured Streamlining for Large Language Models

Large language models (LLMs) have achieved significant progress across various domains, but their increasing scale results in high computational and memory costs. Recent studies have revealed that LLMs exhibit sparsity, providing the…

Machine Learning · Computer Science 2025-07-01 Mingkuan Feng , Jinyang Wu , Shuai Zhang , Pengpeng Shao , Ruihan Jin , Zhengqi Wen , Jianhua Tao , Feihu Che

GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching

Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in deployment…

Computation and Language · Computer Science 2025-06-26 Guinan Su , Li Shen , Lu Yin , Shiwei Liu , Yanwu Yang , Jonas Geiping

Optimizing LLMs for Resource-Constrained Environments: A Survey of Model Compression Techniques

Large Language Models (LLMs) have revolutionized many areas of artificial intelligence (AI), but their substantial resource requirements limit their deployment on mobile and edge devices. This survey paper provides a comprehensive overview…

Machine Learning · Computer Science 2025-09-03 Sanjay Surendranath Girija , Shashank Kapoor , Lakshit Arora , Dipen Pradhan , Aman Raj , Ankit Shetgaonkar

Compressing Large Language Models with Automated Sub-Network Search

Large Language Models (LLMs) demonstrate exceptional reasoning abilities, enabling strong generalization across diverse tasks such as commonsense reasoning and instruction following. However, as LLMs scale, inference costs become…

Computation and Language · Computer Science 2025-02-06 Rhea Sanjay Sukthanker , Benedikt Staffler , Frank Hutter , Aaron Klein

Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems

Large language models (LLMs) have demonstrated remarkable performance across a wide range of industrial applications, from search and recommendation systems to generative tasks. Although scaling laws indicate that larger models generally…

Information Retrieval · Computer Science 2025-10-28 Kayhan Behdin , Ata Fatahibaarzi , Qingquan Song , Yun Dai , Aman Gupta , Zhipeng Wang , Shao Tang , Hejian Sang , Gregory Dexter , Sirou Zhu , Siyu Zhu , Tejas Dharamsi , Vignesh Kothapalli , Zhoutong Fu , Yihan Cao , Pin-Lun Hsu , Fedor Borisyuk , Natesh Pillai , Luke Simon , Rahul Mazumder

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

Structured pruning of modern large language models (LLMs) has emerged as a way of decreasing their high computational needs. Width pruning reduces the size of projection weight matrices (e.g., by removing attention heads) while maintaining…

Machine Learning · Computer Science 2024-06-25 Bo-Kyeong Kim , Geonmin Kim , Tae-Ho Kim , Thibault Castells , Shinkook Choi , Junho Shin , Hyoung-Kyu Song

Structured Pruning of Large Language Models

Large language models have recently achieved state of the art performance across a wide variety of natural language tasks. Meanwhile, the size of these models and their latency have significantly increased, which makes their usage costly,…

Computation and Language · Computer Science 2021-03-30 Ziheng Wang , Jeremy Wohlwend , Tao Lei

Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models

Deep learning models have achieved tremendous success in most of the industries in recent years. The evolution of these models has also led to an increase in the model size and energy requirement, making it difficult to deploy in production…

Machine Learning · Computer Science 2024-07-24 Aayush Saxena , Arit Kumar Bishwas , Ayush Ashok Mishra , Ryan Armstrong

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

Autoregressive Transformers adopted in Large Language Models (LLMs) are hard to scale to long sequences. Despite several works trying to reduce their computational cost, most of LLMs still adopt attention layers between all pairs of tokens…

Computation and Language · Computer Science 2024-06-03 Sotiris Anagnostidis , Dario Pavllo , Luca Biggio , Lorenzo Noci , Aurelien Lucchi , Thomas Hofmann

Large Language Models Are Overparameterized Text Encoders

Large language models (LLMs) demonstrate strong performance as text embedding models when finetuned with supervised contrastive training. However, their large size balloons inference time and memory requirements. In this paper, we show that…

Computation and Language · Computer Science 2024-10-21 Thennal D K , Tim Fischer , Chris Biemann

Can pruning make Large Language Models more efficient?

Transformer models have revolutionized natural language processing with their unparalleled ability to grasp complex contextual relationships. However, the vast number of parameters in these models has raised concerns regarding computational…

Machine Learning · Computer Science 2023-10-10 Sia Gholami , Marwan Omar

Streamlining Redundant Layers to Compress Large Language Models

This paper introduces LLM-Streamline, a pioneer work on layer pruning for large language models (LLMs). It is based on the observation that different layers have varying impacts on hidden states, enabling the identification of less…

Computation and Language · Computer Science 2025-01-28 Xiaodong Chen , Yuxuan Hu , Jing Zhang , Yanling Wang , Cuiping Li , Hong Chen

Pruning Large Language Models via Accuracy Predictor

Large language models(LLMs) containing tens of billions of parameters (or even more) have demonstrated impressive capabilities in various NLP tasks. However, substantial model size poses challenges to training, inference, and deployment so…

Artificial Intelligence · Computer Science 2023-10-11 Yupeng Ji , Yibo Cao , Jiucai Liu

Iterative Structured Pruning for Large Language Models with Multi-Domain Calibration

Large Language Models (LLMs) have achieved remarkable success across a wide spectrum of natural language processing tasks. However, their ever-growing scale introduces significant barriers to real-world deployment, including substantial…

Computation and Language · Computer Science 2026-01-07 Guangxin Wu , Hao Zhang , Zhang Zhibin , Jiafeng Guo , Xueqi Cheng

Pruning Large Language Models by Identifying and Preserving Functional Networks

Structured pruning is one of the representative techniques for compressing large language models (LLMs) to reduce GPU memory consumption and accelerate inference speed. It offers significant practical value in improving the efficiency of…

Computation and Language · Computer Science 2025-08-08 Yiheng Liu , Junhao Ning , Sichen Xia , Xiaohui Gao , Ning Qiang , Bao Ge , Junwei Han , Xintao Hu