English
Related papers

Related papers: Segmentwise Pruning in Audio-Language Models

200 papers

Vision Transformers (ViTs) have achieved state-of-the-art performance across various computer vision tasks, but their high computational cost remains a challenge. Token pruning has been proposed to reduce this cost by selectively removing…

Sound · Computer Science 2025-10-27 Taehan Lee , Hyukjun Lee

We introduce Speech Information Retrieval (SIR), a new long-context task for Speech Large Language Models (Speech LLMs), and present SPIRAL, a 1,012-sample benchmark testing models' ability to extract critical details from approximately…

Audio and Speech Processing · Electrical Eng. & Systems 2025-04-01 Yueqian Lin , Yuzhe Fu , Jingyang Zhang , Yudong Liu , Jianyi Zhang , Jingwei Sun , Hai "Helen" Li , Yiran Chen

Large Multimodal Models (LMMs) have achieved significant success across various tasks. These models usually encode visual inputs into dense token sequences, which are then concatenated with textual tokens and jointly processed by a language…

Computer Vision and Pattern Recognition · Computer Science 2025-10-03 Hao Zhang , Mengsi Lyu , Chenrui He , Yulong Ao , Yonghua Lin

Multimodal large language models (MLLMs) have shown remarkable performance for cross-modal understanding and generation, yet still suffer from severe inference costs. Recently, abundant works have been proposed to solve this problem with…

Computation and Language · Computer Science 2025-05-30 Zichen Wen , Yifeng Gao , Weijia Li , Conghui He , Linfeng Zhang

Recent large audio language models (LALMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet incur high inference costs. Token compression is an effective method that directly reduces redundant tokens in…

Large Multimodal Models (LMMs) have proven effective on various tasks. They typically encode visual inputs into Original Model sequences of tokens, which are then concatenated with textual tokens and jointly processed by the language model.…

Computer Vision and Pattern Recognition · Computer Science 2026-01-01 Hao Zhang , Mengsi Lyu , Bo Huang , Yulong Ao , Yonghua Lin

We study model pruning methods applied to Transformer-based neural network language models for automatic speech recognition. We explore three aspects of the pruning frame work, namely criterion, method and scheduler, analyzing their…

Machine Learning · Computer Science 2023-10-06 Leonardo Emili , Thiago Fraga-Silva , Ernest Pusateri , Markus Nußbaum-Thom , Youssef Oualil

Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Rinyoichi Takezoe , Yaqian Li , Zihao Bo , Anzhou Hou , Mo Guang , Kaiwen Long

Omnimodal Large Language Models (Omni-LLMs) incur substantial computational overhead due to the large number of multimodal input tokens they process, making token reduction essential for real-world deployment. Existing Omni-LLM pruning…

Computer Vision and Pattern Recognition · Computer Science 2026-05-13 Chaeyoung Jung , Kyeongha Rho , Joon Son Chung

Token pruning has emerged as a mainstream approach for developing efficient Video Large Language Models (Video LLMs). This work revisits and advances the two predominant token-pruning paradigms: attention-based selection and…

Computer Vision and Pattern Recognition · Computer Science 2026-04-14 Shukang Yin , Sirui Zhao , Hanchao Wang , Baozhi Jia , Xianquan Wang , Chaoyou Fu , Enhong Chen

Self-supervised speech representation learning (SSL) has shown to be effective in various downstream tasks, but SSL models are usually large and slow. Model compression techniques such as pruning aim to reduce the model size and computation…

Computation and Language · Computer Science 2023-03-01 Yifan Peng , Kwangyoun Kim , Felix Wu , Prashant Sridhar , Shinji Watanabe

Large Vision-Language Models (VLMs) enable strong multimodal reasoning but incur heavy inference costs from redundant visual tokens. Token pruning alleviates this issue, yet existing approaches face limitations. Attention-based methods rely…

Computer Vision and Pattern Recognition · Computer Science 2026-03-23 Youngeun Kim , Youjia Zhang , Huiling Liu , Aecheon Jung , Sunwoo Lee , Sungeun Hong

In this paper, we introduce PruneVid, a visual token pruning method designed to enhance the efficiency of multi-modal video understanding. Large Language Models (LLMs) have shown promising performance in video tasks due to their extended…

Computer Vision and Pattern Recognition · Computer Science 2024-12-23 Xiaohu Huang , Hao Zhou , Kai Han

Multi-modal Large Language Models (MLLMs) have achieved remarkable success by integrating visual and textual modalities. However, they incur significant computational overhead due to the large number of vision tokens processed, limiting…

Computation and Language · Computer Science 2025-03-11 Yizheng Sun , Yanze Xin , Hao Li , Jingyuan Sun , Chenghua Lin , Riza Batista-Navarro

Online video understanding is essential for applications like public surveillance and AI glasses. However, applying Multimodal Large Language Models (MLLMs) to this domain is challenging due to the large number of video frames, resulting in…

Computer Vision and Pattern Recognition · Computer Science 2025-12-16 Xinqi Jin , Hanxun Yu , Bohan Yu , Kebin Liu , Jian Liu , Keda Tao , Yixuan Pei , Huan Wang , Fan Dang , Jiangchuan Liu , Weiqiang Wang

Vision-language models (VLMs) have achieved impressive performance on multimodal reasoning tasks such as visual question answering, image captioning and so on, but their inference cost remains a significant challenge due to the large number…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Weichen Zhang , Zhui Zhu , Ningbo Li , Shilong Tao , Kebin Liu , Yunhao Liu

Speech representation models based on the transformer architecture and trained by self-supervised learning have shown great promise for solving tasks such as speech and speaker recognition, keyword spotting, emotion detection, and more.…

Computation and Language · Computer Science 2024-11-25 Teresa Dorszewski , Lenka Tětková , Lars Kai Hansen

Recent works on accelerating Vision-Language Models achieve strong performance across a variety of vision-language tasks despite highly compressing visual information. In this work, we examine the popular acceleration approach of early…

Computer Vision and Pattern Recognition · Computer Science 2025-08-04 Mark Endo , Xiaohan Wang , Serena Yeung-Levy

Vision Language Models (VLMs) struggle with long-form videos due to the quadratic complexity of attention mechanisms. We propose Language-Guided Temporal Token Pruning (LGTTP), which leverages temporal cues from queries to adaptively prune…

Computer Vision and Pattern Recognition · Computer Science 2025-08-26 Yogesh Kumar

Visual token pruning reduces the computational cost of Vision-Language Models (VLMs) by removing redundant visual tokens. Existing methods typically rely on Gumbel-Softmax to approximate discrete selection during training. However, the…

Computer Vision and Pattern Recognition · Computer Science 2026-05-28 Landi He , Mingde Yao , Shawn Young , Lijian Xu
‹ Prev 1 2 3 10 Next ›