Related papers: Earley-Driven Dynamic Pruning for Efficient Struct…

ZipLM: Inference-Aware Structured Pruning of Language Models

The breakthrough performance of large language models (LLMs) comes with major computational footprints and high deployment costs. In this paper, we progress towards resolving this problem by proposing a novel structured compression approach…

Machine Learning · Computer Science 2023-10-27 Eldar Kurtic , Elias Frantar , Dan Alistarh

Iterative Structured Pruning for Large Language Models with Multi-Domain Calibration

Large Language Models (LLMs) have achieved remarkable success across a wide spectrum of natural language processing tasks. However, their ever-growing scale introduces significant barriers to real-world deployment, including substantial…

Computation and Language · Computer Science 2026-01-07 Guangxin Wu , Hao Zhang , Zhang Zhibin , Jiafeng Guo , Xueqi Cheng

DarwinLM: Evolutionary Structured Pruning of Large Language Models

Large Language Models (LLMs) have achieved significant success across various NLP tasks. However, their massive computational costs limit their widespread use, particularly in real-time applications. Structured pruning offers an effective…

Machine Learning · Computer Science 2025-03-06 Shengkun Tang , Oliver Sieberling , Eldar Kurtic , Zhiqiang Shen , Dan Alistarh

Pruning Large Language Models by Identifying and Preserving Functional Networks

Structured pruning is one of the representative techniques for compressing large language models (LLMs) to reduce GPU memory consumption and accelerate inference speed. It offers significant practical value in improving the efficiency of…

Computation and Language · Computer Science 2025-08-08 Yiheng Liu , Junhao Ning , Sichen Xia , Xiaohui Gao , Ning Qiang , Bao Ge , Junwei Han , Xintao Hu

Frustratingly Easy Task-aware Pruning for Large Language Models

Pruning provides a practical solution to reduce the resources required to run large language models (LLMs) to benefit from their effective capabilities as well as control their cost for training and inference. Research on LLM pruning often…

Computation and Language · Computer Science 2025-10-28 Yuanhe Tian , Junjie Liu , Xican Yang , Haishan Ye , Yan Song

Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation

To ensure that text generated by large language models (LLMs) is in an expected format, constrained decoding proposes to enforce strict formal language constraints during generation. However, as we show in this work, not only do such…

Machine Learning · Computer Science 2024-03-13 Luca Beurer-Kellner , Marc Fischer , Martin Vechev

Dynamic Vocabulary Pruning in Early-Exit LLMs

Increasing the size of large language models (LLMs) has been shown to lead to better performance. However, this comes at the cost of slower and more expensive inference. Early-exiting is a promising approach for improving the efficiency of…

Computation and Language · Computer Science 2024-10-31 Jort Vincenti , Karim Abdel Sadek , Joan Velja , Matteo Nulli , Metod Jazbec

FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression

Large Language Models (LLMs) have enabled remarkable progress in natural language processing, yet their high computational and memory demands pose challenges for deployment in resource-constrained environments. Although recent low-rank…

Computation and Language · Computer Science 2026-02-09 Jiayi Tian , Ryan Solgi , Jinming Lu , Yifan Yang , Hai Li , Zheng Zhang

WGRAMMAR: Leverage Prior Knowledge to Accelerate Structured Decoding

Structured decoding enables large language models (LLMs) to generate outputs in formats required by downstream systems, such as HTML or JSON. However, existing methods suffer from efficiency bottlenecks due to grammar compilation, state…

Artificial Intelligence · Computer Science 2025-07-23 Ran Wang , Xiaoxuan Liu , Hao Ren , Gang Chen , Fanchao Qi , Maosong Sun

Enhancing Large Language Models through Structured Reasoning

Recent Large Language Models (LLMs) have significantly advanced natural language processing and automated decision-making. However, these models still encounter difficulties when performing complex reasoning tasks involving logical…

Computation and Language · Computer Science 2025-06-26 Yubo Dong , Hehe Fan

DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models

Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks, including language modeling, understanding, and generation. However, the increased memory and computational costs associated with…

Computation and Language · Computer Science 2024-11-05 Shangqian Gao , Chi-Heng Lin , Ting Hua , Tang Zheng , Yilin Shen , Hongxia Jin , Yen-Chang Hsu

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

Autoregressive Transformers adopted in Large Language Models (LLMs) are hard to scale to long sequences. Despite several works trying to reduce their computational cost, most of LLMs still adopt attention layers between all pairs of tokens…

Computation and Language · Computer Science 2024-06-03 Sotiris Anagnostidis , Dario Pavllo , Luca Biggio , Lorenzo Noci , Aurelien Lucchi , Thomas Hofmann

SlimLLM: Accurate Structured Pruning for Large Language Models

Large language models(LLMs) have garnered significant attention and demonstrated impressive capabilities in a wide range of applications. However, due to their enormous computational costs, the deployment and application of LLMs are often…

Machine Learning · Computer Science 2025-05-30 Jialong Guo , Xinghao Chen , Yehui Tang , Yunhe Wang

DLP: Dynamic Layerwise Pruning in Large Language Models

Pruning has recently been widely adopted to reduce the parameter scale and improve the inference efficiency of Large Language Models (LLMs). Mainstream pruning techniques often rely on uniform layerwise pruning strategies, which can lead to…

Computation and Language · Computer Science 2025-06-04 Yuli Chen , Bo Cheng , Jiale Han , Yingying Zhang , Yingting Li , Shuhao Zhang

DReSS: Data-driven Regularized Structured Streamlining for Large Language Models

Large language models (LLMs) have achieved significant progress across various domains, but their increasing scale results in high computational and memory costs. Recent studies have revealed that LLMs exhibit sparsity, providing the…

Machine Learning · Computer Science 2025-07-01 Mingkuan Feng , Jinyang Wu , Shuai Zhang , Pengpeng Shao , Ruihan Jin , Zhengqi Wen , Jianhua Tao , Feihu Che

LLM-Pruner: On the Structural Pruning of Large Language Models

Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in both the…

Computation and Language · Computer Science 2023-09-29 Xinyin Ma , Gongfan Fang , Xinchao Wang

Towards Efficient Active Learning in NLP via Pretrained Representations

Fine-tuning Large Language Models (LLMs) is now a common approach for text classification in a wide range of applications. When labeled documents are scarce, active learning helps save annotation efforts but requires retraining of massive…

Machine Learning · Computer Science 2024-02-27 Artem Vysogorets , Achintya Gopal

Enhancing Large Language Model Efficiencyvia Symbolic Compression: A Formal Approach Towards Interpretability

Large language models (LLMs) face significant token efficiency bottlenecks in code generation and logical reasoning tasks, a challenge that directly impacts inference cost and model interpretability. This paper proposes a formal framework…

Artificial Intelligence · Computer Science 2025-02-03 Lumen AI , Tengzhou No. 1 Middle School , Shihao Ji , Zihui Song , Fucheng Zhong , Jisen Jia , Zhaobo Wu , Zheyi Cao , Tianhao Xu

Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models

Large language models deliver strong generative performance but at the cost of massive parameter counts, memory use, and decoding latency. Prior work has shown that pruning and structured sparsity can preserve accuracy under substantial…

Computation and Language · Computer Science 2026-04-17 Andrew Kiruluta

DART-ing Through the Drift: Dynamic Tracing of Knowledge Neurons for Adaptive Inference-Time Pruning

Large Language Models (LLMs) exhibit substantial parameter redundancy, particularly in Feed-Forward Networks (FFNs). Existing pruning methods suffer from two primary limitations. First, reliance on dataset-specific calibration introduces…

Computation and Language · Computer Science 2026-02-02 Abhishek Tyagi , Yunuo Cen , Shrey Dhorajiya , Bharadwaj Veeravalli , Xuanyao Fong