Related papers: Segmentwise Pruning in Audio-Language Models

Token Pruning in Audio Transformers: Optimizing Performance and Decoding Patch Importance

Vision Transformers (ViTs) have achieved state-of-the-art performance across various computer vision tasks, but their high computational cost remains a challenge. Token pruning has been proposed to reduce this cost by selectively removing…

Sound · Computer Science 2025-10-27 Taehan Lee , Hyukjun Lee

SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval

We introduce Speech Information Retrieval (SIR), a new long-context task for Speech Large Language Models (Speech LLMs), and present SPIRAL, a 1,012-sample benchmark testing models' ability to extract critical details from approximately…

Audio and Speech Processing · Electrical Eng. & Systems 2025-04-01 Yueqian Lin , Yuzhe Fu , Jingyang Zhang , Yudong Liu , Jianyi Zhang , Jingwei Sun , Hai "Helen" Li , Yiran Chen

TrimTokenator: Towards Adaptive Visual Token Pruning for Large Multimodal Models

Large Multimodal Models (LMMs) have achieved significant success across various tasks. These models usually encode visual inputs into dense token sequences, which are then concatenated with textual tokens and jointly processed by a language…

Computer Vision and Pattern Recognition · Computer Science 2025-10-03 Hao Zhang , Mengsi Lyu , Chenrui He , Yulong Ao , Yonghua Lin

Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?

Multimodal large language models (MLLMs) have shown remarkable performance for cross-modal understanding and generation, yet still suffer from severe inference costs. Recently, abundant works have been proposed to solve this problem with…

Computation and Language · Computer Science 2025-05-30 Zichen Wen , Yifeng Gao , Weijia Li , Conghui He , Linfeng Zhang

HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models

Recent large audio language models (LALMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet incur high inference costs. Token compression is an effective method that directly reduces redundant tokens in…

Sound · Computer Science 2026-04-28 Peize He , Yaodi Luo , Xiaoqian Liu , Xuyang Liu , Jiahang Deng , Yaosong Du , Bangyu Li , Xiyan Gui , Yuxuan Chen , Linfeng Zhang

TrimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long Contexts

Large Multimodal Models (LMMs) have proven effective on various tasks. They typically encode visual inputs into Original Model sequences of tokens, which are then concatenated with textual tokens and jointly processed by the language model.…

Computer Vision and Pattern Recognition · Computer Science 2026-01-01 Hao Zhang , Mengsi Lyu , Bo Huang , Yulong Ao , Yonghua Lin

Neural Language Model Pruning for Automatic Speech Recognition

We study model pruning methods applied to Transformer-based neural network language models for automatic speech recognition. We explore three aspects of the pruning frame work, namely criterion, method and scheduler, analyzing their…

Machine Learning · Computer Science 2023-10-06 Leonardo Emili , Thiago Fraga-Silva , Ernest Pusateri , Markus Nußbaum-Thom , Youssef Oualil

LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Rinyoichi Takezoe , Yaqian Li , Zihao Bo , Anzhou Hou , Mo Guang , Kaiwen Long

Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs

Omnimodal Large Language Models (Omni-LLMs) incur substantial computational overhead due to the large number of multimodal input tokens they process, making token reduction essential for real-world deployment. Existing Omni-LLM pruning…

Computer Vision and Pattern Recognition · Computer Science 2026-05-13 Chaeyoung Jung , Kyeongha Rho , Joon Son Chung

Tango: Taming Visual Signals for Efficient Video Large Language Models

Token pruning has emerged as a mainstream approach for developing efficient Video Large Language Models (Video LLMs). This work revisits and advances the two predominant token-pruning paradigms: attention-based selection and…

Computer Vision and Pattern Recognition · Computer Science 2026-04-14 Shukang Yin , Sirui Zhao , Hanchao Wang , Baozhi Jia , Xianquan Wang , Chaoyou Fu , Enhong Chen

Structured Pruning of Self-Supervised Pre-trained Models for Speech Recognition and Understanding

Self-supervised speech representation learning (SSL) has shown to be effective in various downstream tasks, but SSL models are usually large and slow. Model compression techniques such as pruning aim to reduce the model size and computation…

Computation and Language · Computer Science 2023-03-01 Yifan Peng , Kwangyoun Kim , Felix Wu , Prashant Sridhar , Shinji Watanabe

ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models

Large Vision-Language Models (VLMs) enable strong multimodal reasoning but incur heavy inference costs from redundant visual tokens. Token pruning alleviates this issue, yet existing approaches face limitations. Attention-based methods rely…

Computer Vision and Pattern Recognition · Computer Science 2026-03-23 Youngeun Kim , Youjia Zhang , Huiling Liu , Aecheon Jung , Sunwoo Lee , Sungeun Hong

PruneVid: Visual Token Pruning for Efficient Video Large Language Models

In this paper, we introduce PruneVid, a visual token pruning method designed to enhance the efficiency of multi-modal video understanding. Large Language Models (LLMs) have shown promising performance in video tasks due to their extended…

Computer Vision and Pattern Recognition · Computer Science 2024-12-23 Xiaohu Huang , Hao Zhou , Kai Han

LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models

Multi-modal Large Language Models (MLLMs) have achieved remarkable success by integrating visual and textual modalities. However, they incur significant computational overhead due to the large number of vision tokens processed, limiting…

Computation and Language · Computer Science 2025-03-11 Yizheng Sun , Yanze Xin , Hao Li , Jingyuan Sun , Chenghua Lin , Riza Batista-Navarro

StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding

Online video understanding is essential for applications like public surveillance and AI glasses. However, applying Multimodal Large Language Models (MLLMs) to this domain is challenging due to the large number of video frames, resulting in…

Computer Vision and Pattern Recognition · Computer Science 2025-12-16 Xinqi Jin , Hanxun Yu , Bohan Yu , Kebin Liu , Jian Liu , Keda Tao , Yixuan Pei , Huan Wang , Fan Dang , Jiangchuan Liu , Weiqiang Wang

AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance

Vision-language models (VLMs) have achieved impressive performance on multimodal reasoning tasks such as visual question answering, image captioning and so on, but their inference cost remains a significant challenge due to the large number…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Weichen Zhang , Zhui Zhu , Ningbo Li , Shilong Tao , Kebin Liu , Yunhao Liu

Convexity-based Pruning of Speech Representation Models

Speech representation models based on the transformer architecture and trained by self-supervised learning have shown great promise for solving tasks such as speech and speaker recognition, keyword spotting, emotion detection, and more.…

Computation and Language · Computer Science 2024-11-25 Teresa Dorszewski , Lenka Tětková , Lars Kai Hansen

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

Recent works on accelerating Vision-Language Models achieve strong performance across a variety of vision-language tasks despite highly compressing visual information. In this work, we examine the popular acceleration approach of early…

Computer Vision and Pattern Recognition · Computer Science 2025-08-04 Mark Endo , Xiaohan Wang , Serena Yeung-Levy

Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing

Vision Language Models (VLMs) struggle with long-form videos due to the quadratic complexity of attention mechanisms. We propose Language-Guided Temporal Token Pruning (LGTTP), which leverages temporal cues from queries to adaptively prune…

Computer Vision and Pattern Recognition · Computer Science 2025-08-26 Yogesh Kumar

Beyond Surrogate Gradients: Fully Differentiable Token Pruning for Vision-Language Models

Visual token pruning reduces the computational cost of Vision-Language Models (VLMs) by removing redundant visual tokens. Existing methods typically rely on Gumbel-Softmax to approximate discrete selection during training. However, the…

Computer Vision and Pattern Recognition · Computer Science 2026-05-28 Landi He , Mingde Yao , Shawn Young , Lijian Xu