Related papers: Video Patch Pruning: Efficient Video Instance Segm…

CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction

Vision transformer (ViT) has achieved competitive accuracy on a variety of computer vision applications, but its computational cost impedes the deployment on resource-limited mobile devices. We explore the sparsity in ViT and observe that…

Computer Vision and Pattern Recognition · Computer Science 2022-03-10 Zhuoran Song , Yihong Xu , Zhezhi He , Li Jiang , Naifeng Jing , Xiaoyao Liang

VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation

Vision Transformers (ViTs) have emerged as the backbone of many segmentation models, consistently achieving state-of-the-art (SOTA) performance. However, their success comes at a significant computational cost. Image token pruning is one of…

Computer Vision and Pattern Recognition · Computer Science 2024-12-02 Hanning Chen , Yang Ni , Wenjun Huang , Yezi Liu , SungHeon Jeong , Fei Wen , Nathaniel Bastian , Hugo Latapie , Mohsen Imani

PPT: Token Pruning and Pooling for Efficient Vision Transformers

Vision Transformers (ViTs) have emerged as powerful models in the field of computer vision, delivering superior performance across various vision tasks. However, the high computational complexity poses a significant barrier to their…

Computer Vision and Pattern Recognition · Computer Science 2024-02-06 Xinjian Wu , Fanhu Zeng , Xiudong Wang , Xinghao Chen

Revisiting Token Pruning for Object Detection and Instance Segmentation

Vision Transformers (ViTs) have shown impressive performance in computer vision, but their high computational cost, quadratic in the number of tokens, limits their adoption in computation-constrained applications. However, this large number…

Computer Vision and Pattern Recognition · Computer Science 2023-12-14 Yifei Liu , Mathias Gehrig , Nico Messikommer , Marco Cannici , Davide Scaramuzza

EVTP-IVS: Effective Visual Token Pruning For Unifying Instruction Visual Segmentation In Multi-Modal Large Language Models

Instructed Visual Segmentation (IVS) tasks require segmenting objects in images or videos based on natural language instructions. While recent multimodal large language models (MLLMs) have achieved strong performance on IVS, their inference…

Computer Vision and Pattern Recognition · Computer Science 2025-08-19 Wenhui Zhu , Xiwen Chen , Zhipeng Wang , Shao Tang , Sayan Ghosh , Xuanzhao Dong , Rajat Koner , Yalin Wang

SPViT: Enabling Faster Vision Transformers via Soft Token Pruning

Recently, Vision Transformer (ViT) has continuously established new milestones in the computer vision field, while the high computation and memory cost makes its propagation in industrial production difficult. Pruning, a traditional model…

Computer Vision and Pattern Recognition · Computer Science 2022-09-22 Zhenglun Kong , Peiyan Dong , Xiaolong Ma , Xin Meng , Mengshu Sun , Wei Niu , Xuan Shen , Geng Yuan , Bin Ren , Minghai Qin , Hao Tang , Yanzhi Wang

Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference

Vision-language models (VLMs) have recently expanded from static image understanding to video reasoning, but their scalability is fundamentally limited by the quadratic cost of processing dense frame sequences. Long videos often exceed the…

Computer Vision and Pattern Recognition · Computer Science 2025-10-17 Natan Bagrov , Eugene Khvedchenia , Borys Tymchenko , Shay Aharon , Lior Kadoch , Tomer Keren , Ofri Masad , Yonatan Geifman , Ran Zilberstein , Tuomas Rintamaki , Matthieu Le , Andrew Tao

A Unified Pruning Framework for Vision Transformers

Recently, vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks. Yet the high computational costs and training data requirements of ViTs limit their application in…

Computer Vision and Pattern Recognition · Computer Science 2021-12-01 Hao Yu , Jianxin Wu

Unified Visual Transformer Compression

Vision transformers (ViTs) have gained popularity recently. Even without customized image operators such as convolutions, ViTs can yield competitive performance when properly trained on massive data. However, the computational overhead of…

Machine Learning · Computer Science 2022-03-17 Shixing Yu , Tianlong Chen , Jiayi Shen , Huan Yuan , Jianchao Tan , Sen Yang , Ji Liu , Zhangyang Wang

Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors

Vision Transformer (ViT)-based sparse multi-view 3D object detectors have achieved remarkable accuracy but still suffer from high inference latency due to heavy token processing. To accelerate these models, token compression has been widely…

Computer Vision and Pattern Recognition · Computer Science 2026-04-17 Mingqian Ji , Shanshan Zhang , Jian Yang

Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers

Although vision transformers (ViTs) have shown promising results in various computer vision tasks recently, their high computational cost limits their practical applications. Previous approaches that prune redundant tokens have demonstrated…

Computer Vision and Pattern Recognition · Computer Science 2023-04-24 Siyuan Wei , Tianzhu Ye , Shen Zhang , Yao Tang , Jiajun Liang

Patch Slimming for Efficient Vision Transformers

This paper studies the efficiency problem for visual transformers by excavating redundant calculation in given networks. The recent transformer architecture has demonstrated its effectiveness for achieving excellent performance on a series…

Computer Vision and Pattern Recognition · Computer Science 2022-04-05 Yehui Tang , Kai Han , Yunhe Wang , Chang Xu , Jianyuan Guo , Chao Xu , Dacheng Tao

Back to Fundamentals: Low-Level Visual Features Guided Progressive Token Pruning

Vision Transformers (ViTs) excel in semantic segmentation but demand significant computation, posing challenges for deployment on resource-constrained devices. Existing token pruning methods often overlook fundamental visual data…

Computer Vision and Pattern Recognition · Computer Science 2025-04-28 Yuanbing Ouyang , Yizhuo Liang , Qingpeng Li , Xinfei Guo , Yiming Luo , Di Wu , Hao Wang , Yushan Pan

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Attention is sparse in vision transformers. We observe the final prediction in vision transformers is only based on a subset of most informative tokens, which is sufficient for accurate image recognition. Based on this observation, we…

Computer Vision and Pattern Recognition · Computer Science 2021-10-27 Yongming Rao , Wenliang Zhao , Benlin Liu , Jiwen Lu , Jie Zhou , Cho-Jui Hsieh

Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions

Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early-Pruning), a hybrid token-reduction…

Computer Vision and Pattern Recognition · Computer Science 2026-05-21 Michal Szczepanski , Martyna Poreba , Karim Haroun

StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding

Online video understanding is essential for applications like public surveillance and AI glasses. However, applying Multimodal Large Language Models (MLLMs) to this domain is challenging due to the large number of video frames, resulting in…

Computer Vision and Pattern Recognition · Computer Science 2025-12-16 Xinqi Jin , Hanxun Yu , Bohan Yu , Kebin Liu , Jian Liu , Keda Tao , Yixuan Pei , Huan Wang , Fan Dang , Jiangchuan Liu , Weiqiang Wang

TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

Vision Transformers (ViTs) have been widely used in large-scale Vision and Language Pre-training (VLP) models. Though previous VLP works have proved the effectiveness of ViTs, they still suffer from computational efficiency brought by the…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Chaoya Jiang , Haiyang Xu , Chenliang Li , Miang Yan , Wei Ye , Shikun Zhang , Bin Bi , Songfang Huang

Video Mask Transfiner for High-Quality Video Instance Segmentation

While Video Instance Segmentation (VIS) has seen rapid progress, current approaches struggle to predict high-quality masks with accurate boundary details. Moreover, the predicted segmentations often fluctuate over time, suggesting that…

Computer Vision and Pattern Recognition · Computer Science 2022-07-29 Lei Ke , Henghui Ding , Martin Danelljan , Yu-Wing Tai , Chi-Keung Tang , Fisher Yu

Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

Vision Transformers (ViTs) have become increasingly popular in large-scale Vision and Language Pre-training (VLP) models. Although previous VLP research has demonstrated the efficacy of ViTs, these efforts still struggle with computational…

Computer Vision and Pattern Recognition · Computer Science 2024-03-14 Wei Ye , Chaoya Jiang , Haiyang Xu , Chenhao Ye , Chenliang Li , Ming Yan , Shikun Zhang , Songhang Huang , Fei Huang

Patch Ranking: Efficient CLIP by Learning to Rank Local Patches

Contrastive image-text pre-trained models such as CLIP have shown remarkable adaptability to downstream tasks. However, they face challenges due to the high computational requirements of the Vision Transformer (ViT) backbone. Current…

Computer Vision and Pattern Recognition · Computer Science 2024-12-02 Cheng-En Wu , Jinhong Lin , Yu Hen Hu , Pedro Morgado