Push Puppet Networks: Structured Bayesian Pruning Algorithm for Language Model Compression
摘要
This paper presents push puppet networks, a novel Bayesian algorithm for structured pruning of large language models. The push puppet network learns a hierarchical function during training that can optimally determine specific network layers to keep for a given target size. By adding a small number of gating parameters via a hierarchical penalty function, the learned smooth function can allow for a network to be resized to very specific sizes without loading the full model into memory or requiring further post-training computation. The method compares favorably with existing approaches (SparseGPT, Wanda) at high pruning sizes (less than 50% of network structure) while realizing measurable speed-ups on conventional GPUs with PyTorch. Furthermore, push puppet networks can achieve significant speedups as candidates for speculative decoding.
引用
@article{arxiv.2606.28251,
title = {Push Puppet Networks: Structured Bayesian Pruning Algorithm for Language Model Compression},
author = {Robert Kubinec},
journal= {arXiv preprint arXiv:2606.28251},
year = {2026}
}