English
Related papers

Related papers: VLTP: Vision-Language Guided Token Pruning for Tas…

200 papers

Instructed Visual Segmentation (IVS) tasks require segmenting objects in images or videos based on natural language instructions. While recent multimodal large language models (MLLMs) have achieved strong performance on IVS, their inference…

Computer Vision and Pattern Recognition · Computer Science 2025-08-19 Wenhui Zhu , Xiwen Chen , Zhipeng Wang , Shao Tang , Sayan Ghosh , Xuanzhao Dong , Rajat Koner , Yalin Wang

Vision Transformers (ViTs) excel in semantic segmentation but demand significant computation, posing challenges for deployment on resource-constrained devices. Existing token pruning methods often overlook fundamental visual data…

Computer Vision and Pattern Recognition · Computer Science 2025-04-28 Yuanbing Ouyang , Yizhuo Liang , Qingpeng Li , Xinfei Guo , Yiming Luo , Di Wu , Hao Wang , Yushan Pan

Vision Transformers (ViTs) have shown impressive performance in computer vision, but their high computational cost, quadratic in the number of tokens, limits their adoption in computation-constrained applications. However, this large number…

Computer Vision and Pattern Recognition · Computer Science 2023-12-14 Yifei Liu , Mathias Gehrig , Nico Messikommer , Marco Cannici , Davide Scaramuzza

In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant vision tokens for high VLM inference efficiency has…

Computer Vision and Pattern Recognition · Computer Science 2026-05-28 Guangyuan Li , Rongzhen Zhao , Jinhong Deng , Yanbo Wang , Joni Pajarinen

Large Vision Language Models (LVLMs) have been widely adopted to guide vision foundation models in performing reasoning segmentation tasks, achieving impressive performance. However, the substantial computational overhead associated with…

Computer Vision and Pattern Recognition · Computer Science 2025-04-16 Hanning Chen , Yang Ni , Wenjun Huang , Hyunwoo Oh , Yezi Liu , Tamoghno Das , Mohsen Imani

Vision language models (VLMs) demonstrate strong capabilities in jointly processing visual and textual data. However, they often incur substantial computational overhead due to redundant visual information, particularly in long-form video…

Machine Learning · Computer Science 2025-04-25 Yudong Liu , Jingwei Sun , Yueqian Lin , Jingyang Zhang , Ming Yin , Qinsi Wang , Jianyi Zhang , Hai Li , Yiran Chen

Vision-Language Transformers (VLTs) have shown great success recently, but are meanwhile accompanied by heavy computation costs, where a major reason can be attributed to the large number of visual and language tokens. Existing token…

Computer Vision and Pattern Recognition · Computer Science 2024-03-06 Jianjian Cao , Peng Ye , Shengze Li , Chong Yu , Yansong Tang , Jiwen Lu , Tao Chen

Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens. However, the large number of image tokens results in significant computational overhead, and the…

Computer Vision and Pattern Recognition · Computer Science 2025-10-24 Kaiyuan Li , Xiaoyue Chen , Chen Gao , Yong Li , Xinlei Chen

Multi-modal Large Language Models (MLLMs) have achieved remarkable success by integrating visual and textual modalities. However, they incur significant computational overhead due to the large number of vision tokens processed, limiting…

Computation and Language · Computer Science 2025-03-11 Yizheng Sun , Yanze Xin , Hao Li , Jingyuan Sun , Chenghua Lin , Riza Batista-Navarro

Vision transformers have achieved leading performance on various visual tasks yet still suffer from high computational complexity. The situation deteriorates in dense prediction tasks like semantic segmentation, as high-resolution inputs…

Computer Vision and Pattern Recognition · Computer Science 2023-09-29 Quan Tang , Bowen Zhang , Jiajun Liu , Fagui Liu , Yifan Liu

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning and generation, yet their high computational demands remain a major challenge. Diffusion Vision-Language Models (DVLMs) are particularly attractive…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Jingqi Xu , Jingxi Lu , Chenghao Li , Sreetama Sarkar , Souvik Kundu , Peter A. Beerel

Visual token pruning is a promising approach for reducing the computational cost of vision-language models (VLMs), and existing methods often rely on early pruning decisions to improve efficiency. While effective on coarse-grained reasoning…

Computer Vision and Pattern Recognition · Computer Science 2026-02-04 Chen Qian , Xinran Yu , Danyang Li , Guoxuan Chi , Zheng Yang , Qiang Ma , Xin Miao

Vision Transformers (ViTs) have emerged as powerful models in the field of computer vision, delivering superior performance across various vision tasks. However, the high computational complexity poses a significant barrier to their…

Computer Vision and Pattern Recognition · Computer Science 2024-02-06 Xinjian Wu , Fanhu Zeng , Xiudong Wang , Xinghao Chen

Large Vision-Language Models (LVLMs) incur high computational costs due to significant redundancy in their visual tokens. To effectively reduce this cost, researchers have proposed various visual token pruning methods. However, existing…

Computer Vision and Pattern Recognition · Computer Science 2026-01-27 Wen Luo , Peng Chen , Xiaotao Huang , LiQun Huang

The adoption of Vision Transformers (ViTs) in resource-constrained applications necessitates improvements in inference throughput. To this end several token pruning and merging approaches have been proposed that improve efficiency by…

Computer Vision and Pattern Recognition · Computer Science 2024-12-03 Benjamin Bergner , Christoph Lippert , Aravindh Mahendran

Recently, Vision Transformer (ViT) has continuously established new milestones in the computer vision field, while the high computation and memory cost makes its propagation in industrial production difficult. Pruning, a traditional model…

Computer Vision and Pattern Recognition · Computer Science 2022-09-22 Zhenglun Kong , Peiyan Dong , Xiaolong Ma , Xin Meng , Mengshu Sun , Wei Niu , Xuan Shen , Geng Yuan , Bin Ren , Minghai Qin , Hao Tang , Yanzhi Wang

Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by…

Computer Vision and Pattern Recognition · Computer Science 2026-01-23 Chenyang Li , Jieyuan Liu , Bin Li , Bo Gao , Yilin Yuan , Yangfan He , Yuchen Li , Jingqun Tang

Recently, visual token pruning has been studied to handle the vast number of visual tokens in Multimodal Large Language Models. However, we observe that while existing pruning methods perform reliably on simple visual understanding, they…

Computer Vision and Pattern Recognition · Computer Science 2026-04-16 Jiwan Kim , Kibum Kim , Wonjoong Kim , Byung-Kwan Lee , Chanyoung Park

Large Vision-Language Models (LVLMs) have recently demonstrated strong multimodal understanding, yet their fine-grained visual perception is often constrained by low input resolutions. A common remedy is to partition high-resolution images…

Computer Vision and Pattern Recognition · Computer Science 2026-02-17 Yuxuan Liang , Xu Li , Xiaolei Chen , Yi Zheng , Haotian Chen , Bin Li , Xiangyang Xue

Vision Language Models (VLMs) struggle with long-form videos due to the quadratic complexity of attention mechanisms. We propose Language-Guided Temporal Token Pruning (LGTTP), which leverages temporal cues from queries to adaptively prune…

Computer Vision and Pattern Recognition · Computer Science 2025-08-26 Yogesh Kumar
‹ Prev 1 2 3 10 Next ›