English

Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression

Machine Learning 2025-04-09 v1 Artificial Intelligence Computation and Language Performance

Abstract

This paper presents Thanos, a novel weight-pruning algorithm designed to reduce the memory footprint and enhance the computational efficiency of large language models (LLMs) by removing redundant weights while maintaining accuracy. Thanos introduces a block-wise pruning strategy with adaptive masks that dynamically adjust to weight importance, enabling flexible sparsity patterns and structured formats, such as n:mn:m sparsity, optimized for hardware acceleration. Experimental evaluations demonstrate that Thanos achieves state-of-the-art performance in structured pruning and outperforms existing methods in unstructured pruning. By providing an efficient and adaptable approach to model compression, Thanos offers a practical solution for deploying large models in resource-constrained environments.

Keywords

Cite

@article{arxiv.2504.05346,
  title  = {Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression},
  author = {Ivan Ilin and Peter Richtarik},
  journal= {arXiv preprint arXiv:2504.05346},
  year   = {2025}
}

Comments

8 pages, 3 Figures, 3 Tables, 2 Algorithms, paper comes with Appendix

R2 v1 2026-06-28T22:49:49.480Z