English

Entropy-Based Block Pruning for Efficient Large Language Models

Computation and Language 2025-04-08 v1 Artificial Intelligence

Abstract

As large language models continue to scale, their growing computational and storage demands pose significant challenges for real-world deployment. In this work, we investigate redundancy within Transformer-based models and propose an entropy-based pruning strategy to enhance efficiency while maintaining performance. Empirical analysis reveals that the entropy of hidden representations decreases in the early blocks but progressively increases across most subsequent blocks. This trend suggests that entropy serves as a more effective measure of information richness within computation blocks. Unlike cosine similarity, which primarily captures geometric relationships, entropy directly quantifies uncertainty and information content, making it a more reliable criterion for pruning. Extensive experiments demonstrate that our entropy-based pruning approach surpasses cosine similarity-based methods in reducing model size while preserving accuracy, offering a promising direction for efficient model deployment.

Keywords

Cite

@article{arxiv.2504.03794,
  title  = {Entropy-Based Block Pruning for Efficient Large Language Models},
  author = {Liangwei Yang and Yuhui Xu and Juntao Tan and Doyen Sahoo and Silvio Savarese and Caiming Xiong and Huan Wang and Shelby Heinecke},
  journal= {arXiv preprint arXiv:2504.03794},
  year   = {2025}
}

Comments

9 pages, 8 figures

R2 v1 2026-06-28T22:47:31.806Z