Related papers: VLTP: Vision-Language Guided Token Pruning for Tas…

EVTP-IVS: Effective Visual Token Pruning For Unifying Instruction Visual Segmentation In Multi-Modal Large Language Models

Instructed Visual Segmentation (IVS) tasks require segmenting objects in images or videos based on natural language instructions. While recent multimodal large language models (MLLMs) have achieved strong performance on IVS, their inference…

Computer Vision and Pattern Recognition · Computer Science 2025-08-19 Wenhui Zhu , Xiwen Chen , Zhipeng Wang , Shao Tang , Sayan Ghosh , Xuanzhao Dong , Rajat Koner , Yalin Wang

Back to Fundamentals: Low-Level Visual Features Guided Progressive Token Pruning

Vision Transformers (ViTs) excel in semantic segmentation but demand significant computation, posing challenges for deployment on resource-constrained devices. Existing token pruning methods often overlook fundamental visual data…

Computer Vision and Pattern Recognition · Computer Science 2025-04-28 Yuanbing Ouyang , Yizhuo Liang , Qingpeng Li , Xinfei Guo , Yiming Luo , Di Wu , Hao Wang , Yushan Pan

Revisiting Token Pruning for Object Detection and Instance Segmentation

Vision Transformers (ViTs) have shown impressive performance in computer vision, but their high computational cost, quadratic in the number of tokens, limits their adoption in computation-constrained applications. However, this large number…

Computer Vision and Pattern Recognition · Computer Science 2023-12-14 Yifei Liu , Mathias Gehrig , Nico Messikommer , Marco Cannici , Davide Scaramuzza

Object-Centric Vision Token Pruning for Vision Language Models

In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant vision tokens for high VLM inference efficiency has…

Computer Vision and Pattern Recognition · Computer Science 2026-05-28 Guangyuan Li , Rongzhen Zhao , Jinhong Deng , Yanbo Wang , Joni Pajarinen

LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation

Large Vision Language Models (LVLMs) have been widely adopted to guide vision foundation models in performing reasoning segmentation tasks, achieving impressive performance. However, the substantial computational overhead associated with…

Computer Vision and Pattern Recognition · Computer Science 2025-04-16 Hanning Chen , Yang Ni , Wenjun Huang , Hyunwoo Oh , Yezi Liu , Tamoghno Das , Mohsen Imani

Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing

Vision language models (VLMs) demonstrate strong capabilities in jointly processing visual and textual data. However, they often incur substantial computational overhead due to redundant visual information, particularly in long-form video…

Machine Learning · Computer Science 2025-04-25 Yudong Liu , Jingwei Sun , Yueqian Lin , Jingyang Zhang , Ming Yin , Qinsi Wang , Jianyi Zhang , Hai Li , Yiran Chen

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

Vision-Language Transformers (VLTs) have shown great success recently, but are meanwhile accompanied by heavy computation costs, where a major reason can be attributed to the large number of visual and language tokens. Existing token…

Computer Vision and Pattern Recognition · Computer Science 2024-03-06 Jianjian Cao , Peng Ye , Shengze Li , Chong Yu , Yansong Tang , Jiwen Lu , Tao Chen

Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization

Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens. However, the large number of image tokens results in significant computational overhead, and the…

Computer Vision and Pattern Recognition · Computer Science 2025-10-24 Kaiyuan Li , Xiaoyue Chen , Chen Gao , Yong Li , Xinlei Chen

LVPruning: An Effective yet Simple Language-Guided Vision Token Pruning Approach for Multi-modal Large Language Models

Multi-modal Large Language Models (MLLMs) have achieved remarkable success by integrating visual and textual modalities. However, they incur significant computational overhead due to the large number of vision tokens processed, limiting…

Computation and Language · Computer Science 2025-03-11 Yizheng Sun , Yanze Xin , Hao Li , Jingyuan Sun , Chenghua Lin , Riza Batista-Navarro

Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation

Vision transformers have achieved leading performance on various visual tasks yet still suffer from high computational complexity. The situation deteriorates in dense prediction tasks like semantic segmentation, as high-resolution inputs…

Computer Vision and Pattern Recognition · Computer Science 2023-09-29 Quan Tang , Bowen Zhang , Jiajun Liu , Fagui Liu , Yifan Liu

RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning and generation, yet their high computational demands remain a major challenge. Diffusion Vision-Language Models (DVLMs) are particularly attractive…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Jingqi Xu , Jingxi Lu , Chenghao Li , Sreetama Sarkar , Souvik Kundu , Peter A. Beerel

SwiftVLM: Efficient Vision-Language Model Inference via Cross-Layer Token Bypass

Visual token pruning is a promising approach for reducing the computational cost of vision-language models (VLMs), and existing methods often rely on early pruning decisions to improve efficiency. While effective on coarse-grained reasoning…

Computer Vision and Pattern Recognition · Computer Science 2026-02-04 Chen Qian , Xinran Yu , Danyang Li , Guoxuan Chi , Zheng Yang , Qiang Ma , Xin Miao

PPT: Token Pruning and Pooling for Efficient Vision Transformers

Vision Transformers (ViTs) have emerged as powerful models in the field of computer vision, delivering superior performance across various vision tasks. However, the high computational complexity poses a significant barrier to their…

Computer Vision and Pattern Recognition · Computer Science 2024-02-06 Xinjian Wu , Fanhu Zeng , Xiudong Wang , Xinghao Chen

ViTCoP: Accelerating Large Vision-Language Models via Visual and Textual Semantic Collaborative Pruning

Large Vision-Language Models (LVLMs) incur high computational costs due to significant redundancy in their visual tokens. To effectively reduce this cost, researchers have proposed various visual token pruning methods. However, existing…

Computer Vision and Pattern Recognition · Computer Science 2026-01-27 Wen Luo , Peng Chen , Xiaotao Huang , LiQun Huang

Token Cropr: Faster ViTs for Quite a Few Tasks

The adoption of Vision Transformers (ViTs) in resource-constrained applications necessitates improvements in inference throughput. To this end several token pruning and merging approaches have been proposed that improve efficiency by…

Computer Vision and Pattern Recognition · Computer Science 2024-12-03 Benjamin Bergner , Christoph Lippert , Aravindh Mahendran

SPViT: Enabling Faster Vision Transformers via Soft Token Pruning

Recently, Vision Transformer (ViT) has continuously established new milestones in the computer vision field, while the high computation and memory cost makes its propagation in industrial production difficult. Pruning, a traditional model…

Computer Vision and Pattern Recognition · Computer Science 2022-09-22 Zhenglun Kong , Peiyan Dong , Xiaolong Ma , Xin Meng , Mengshu Sun , Wei Niu , Xuan Shen , Geng Yuan , Bin Ren , Minghai Qin , Hao Tang , Yanzhi Wang

DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models

Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by…

Computer Vision and Pattern Recognition · Computer Science 2026-01-23 Chenyang Li , Jieyuan Liu , Bin Li , Bo Gao , Yilin Yuan , Yangfan He , Yuchen Li , Jingqun Tang

Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

Recently, visual token pruning has been studied to handle the vast number of visual tokens in Multimodal Large Language Models. However, we observe that while existing pruning methods perform reliably on simple visual understanding, they…

Computer Vision and Pattern Recognition · Computer Science 2026-04-16 Jiwan Kim , Kibum Kim , Wonjoong Kim , Byung-Kwan Lee , Chanyoung Park

Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance

Large Vision-Language Models (LVLMs) have recently demonstrated strong multimodal understanding, yet their fine-grained visual perception is often constrained by low input resolutions. A common remedy is to partition high-resolution images…

Computer Vision and Pattern Recognition · Computer Science 2026-02-17 Yuxuan Liang , Xu Li , Xiaolei Chen , Yi Zheng , Haotian Chen , Bin Li , Xiangyang Xue

Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing

Vision Language Models (VLMs) struggle with long-form videos due to the quadratic complexity of attention mechanisms. We propose Language-Guided Temporal Token Pruning (LGTTP), which leverages temporal cues from queries to adaptively prune…

Computer Vision and Pattern Recognition · Computer Science 2025-08-26 Yogesh Kumar