Related papers: ShaRP: SHAllow-LayeR Pruning for Efficient Video L…

SHRP: Specialized Head Routing and Pruning for Efficient Encoder Compression

Transformer encoders are widely deployed in large-scale web services for natural language understanding tasks such as text classification, semantic retrieval, and content ranking. However, their high inference latency and memory consumption…

Machine Learning · Computer Science 2025-12-25 Zeli Su , Ziyin Zhang , Wenzheng Zhang , Zhou Liu , Guixian Xu , Wentao Zhang

LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Rinyoichi Takezoe , Yaqian Li , Zihao Bo , Anzhou Hou , Mo Guang , Kaiwen Long

PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a range of multimodal tasks. However, their inference efficiency is constrained by the large number of visual tokens processed during decoding. To address…

Computer Vision and Pattern Recognition · Computer Science 2025-02-21 Yu Meng , Kaiyuan Li , Chenran Huang , Chen Gao , Xinlei Chen , Yong Li , Xiaoping Zhang

Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning

Current Video Large Language Models (VideoLLMs) suffer from quadratic computational complexity and key-value cache scaling, due to their reliance on processing excessive redundant visual tokens. To address this problem, we propose SharpV, a…

Computer Vision and Pattern Recognition · Computer Science 2025-12-05 Jialong Qin , Xin Zou , Di Lu , Yibo Yan , Xuming Hu

ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference

While Large Vision-Language Models (LVLMs) demonstrate exceptional multi-modal capabilities, the quadratic computational cost of processing high-resolution visual tokens remains a critical bottleneck. Though recent token reduction…

Computer Vision and Pattern Recognition · Computer Science 2026-03-19 Surendra Pathak , Bo Han

FastAV: Efficient Token Pruning for Audio-Visual Large Language Model Inference

In this work, we present FastAV, the first token pruning framework tailored for audio-visual large language models (AV-LLMs). While token pruning has been actively explored in standard large language models (LLMs) and vision-language models…

Machine Learning · Computer Science 2026-01-21 Chaeyoung Jung , Youngjoon Jang , Seungwoo Lee , Joon Son Chung

Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

Large vision-language models (LVLMs) generally contain significantly more visual tokens than their textual counterparts, resulting in a considerable computational burden. Recent efforts have been made to tackle this issue by pruning visual…

Computer Vision and Pattern Recognition · Computer Science 2025-05-13 Qizhe Zhang , Aosong Cheng , Ming Lu , Renrui Zhang , Zhiyong Zhuo , Jiajun Cao , Shaobo Guo , Qi She , Shanghang Zhang

LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs

Large vision-language models (LVLMs) achieve strong multimodal understanding, but their inference cost grows rapidly with the number of visual tokens, especially for high-resolution images and long videos. Existing attention-based methods…

Computer Vision and Pattern Recognition · Computer Science 2026-05-18 Hongyu Lu , Feng Zhang , Wenwei Jin , Huanling Hu , Tianjun Shi , Shikai Jiang , Yao Hu , Jiawei Li

ViTCoP: Accelerating Large Vision-Language Models via Visual and Textual Semantic Collaborative Pruning

Large Vision-Language Models (LVLMs) incur high computational costs due to significant redundancy in their visual tokens. To effectively reduce this cost, researchers have proposed various visual token pruning methods. However, existing…

Computer Vision and Pattern Recognition · Computer Science 2026-01-27 Wen Luo , Peng Chen , Xiaotao Huang , LiQun Huang

IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models

Large Vision Language Models show impressive performance across image and video understanding tasks, yet their computational cost grows rapidly with the number of visual tokens. Existing token pruning methods mitigate this issue through…

Computer Vision and Pattern Recognition · Computer Science 2026-04-02 Dong-Jae Lee , Sunghyun Baek , Junmo Kim

Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse…

Computer Vision and Pattern Recognition · Computer Science 2026-02-16 Omer Faruk Deniz , Ruiyu Mao , Ruochen Li , Yapeng Tian , Latifur Khan

Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model

Multimodal large language models (MLLMs) are plagued by exorbitant inference costs attributable to the profusion of visual tokens within the vision encoder. The redundant visual tokens engenders a substantial computational load and…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Jiedong Zhuang , Lu Lu , Ming Dai , Rui Hu , Jian Chen , Qiang Liu , Haoji Hu

FoPru: Focal Pruning for Efficient Large Vision-Language Models

Large Vision-Language Models (LVLMs) represent a significant advancement toward achieving superior multimodal capabilities by enabling powerful Large Language Models (LLMs) to understand visual input. Typically, LVLMs utilize visual…

Computer Vision and Pattern Recognition · Computer Science 2024-11-22 Lei Jiang , Weizhe Huang , Tongxuan Liu , Yuting Zeng , Jing Li , Lechao Cheng , Xiaohua Xu

Attention Debiasing for Token Pruning in Vision Language Models

Vision-language models (VLMs) typically encode substantially more visual tokens than text tokens, resulting in significant token redundancy. Pruning uninformative visual tokens is therefore crucial for improving computational efficiency,…

Computer Vision and Pattern Recognition · Computer Science 2026-01-19 Kai Zhao , Wubang Yuan , Yuchen Lin , Liting Ruan , Xiaofeng Lu , Deng-Ping Fan , Ming-Ming Cheng , Dan Zeng

A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs

Vision-language models (VLMs) have shown remarkable success across various multi-modal tasks, yet large VLMs encounter significant efficiency challenges due to processing numerous visual tokens. A promising approach to accelerating large…

Computer Vision and Pattern Recognition · Computer Science 2024-12-06 Wangbo Zhao , Yizeng Han , Jiasheng Tang , Zhikai Li , Yibing Song , Kai Wang , Zhangyang Wang , Yang You

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification

The efficiency of large vision-language models (LVLMs) is constrained by the computational bottleneck of the attention mechanism during the prefill phase and the memory bottleneck of fetching the key-value (KV) cache in the decoding phase,…

Computer Vision and Pattern Recognition · Computer Science 2024-12-19 Yefei He , Feng Chen , Jing Liu , Wenqi Shao , Hong Zhou , Kaipeng Zhang , Bohan Zhuang

VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization

Large Multimodal Models (LMMs) excel in visual-language tasks by leveraging numerous visual tokens for fine-grained visual information, but this token redundancy results in significant computational costs. Previous research aimed at…

Computer Vision and Pattern Recognition · Computer Science 2025-09-12 Sihan Yang , Runsen Xu , Chenhang Cui , Tai Wang , Dahua Lin , Jiangmiao Pang

LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation

Large Vision Language Models (LVLMs) have been widely adopted to guide vision foundation models in performing reasoning segmentation tasks, achieving impressive performance. However, the substantial computational overhead associated with…

Computer Vision and Pattern Recognition · Computer Science 2025-04-16 Hanning Chen , Yang Ni , Wenjun Huang , Hyunwoo Oh , Yezi Liu , Tamoghno Das , Mohsen Imani

SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language Models

Despite achieving remarkable performance on various vision-language tasks, Transformer-based Vision-Language Models (VLMs) suffer from redundancy in inputs and parameters, significantly hampering their efficiency in real-world applications.…

Computation and Language · Computer Science 2024-02-27 Zekun Wang , Jingchang Chen , Wangchunshu Zhou , Haichao Zhu , Jiafeng Liang , Liping Shan , Ming Liu , Dongliang Xu , Qing Yang , Bing Qin

Cross-Self KV Cache Pruning for Efficient Vision-Language Inference

KV cache pruning has emerged as a promising technique for reducing memory and computation costs in long-context auto-regressive generation. Existing methods for vision-language models (VLMs) typically rely on self-attention scores from…

Computer Vision and Pattern Recognition · Computer Science 2024-12-09 Xiaohuan Pei , Tao Huang , Chang Xu