中文

Push Puppet Networks: Structured Bayesian Pruning Algorithm for Language Model Compression

应用统计 2026-06-26 v1

摘要

This paper presents push puppet networks, a novel Bayesian algorithm for structured pruning of large language models. The push puppet network learns a hierarchical function during training that can optimally determine specific network layers to keep for a given target size. By adding a small number of gating parameters via a hierarchical penalty function, the learned smooth function can allow for a network to be resized to very specific sizes without loading the full model into memory or requiring further post-training computation. The method compares favorably with existing approaches (SparseGPT, Wanda) at high pruning sizes (less than 50% of network structure) while realizing measurable speed-ups on conventional GPUs with PyTorch. Furthermore, push puppet networks can achieve significant speedups as candidates for speculative decoding.

引用

@article{arxiv.2606.28251,
  title  = {Push Puppet Networks: Structured Bayesian Pruning Algorithm for Language Model Compression},
  author = {Robert Kubinec},
  journal= {arXiv preprint arXiv:2606.28251},
  year   = {2026}
}