Related papers: Token Pruning in Audio Transformers: Optimizing Pe…

Revisiting Token Pruning for Object Detection and Instance Segmentation

Vision Transformers (ViTs) have shown impressive performance in computer vision, but their high computational cost, quadratic in the number of tokens, limits their adoption in computation-constrained applications. However, this large number…

Computer Vision and Pattern Recognition · Computer Science 2023-12-14 Yifei Liu , Mathias Gehrig , Nico Messikommer , Marco Cannici , Davide Scaramuzza

Which Tokens to Use? Investigating Token Reduction in Vision Transformers

Since the introduction of the Vision Transformer (ViT), researchers have sought to make ViTs more efficient by removing redundant information in the processed tokens. While different methods have been explored to achieve this goal, we still…

Computer Vision and Pattern Recognition · Computer Science 2023-08-10 Joakim Bruslund Haurum , Sergio Escalera , Graham W. Taylor , Thomas B. Moeslund

Token Cropr: Faster ViTs for Quite a Few Tasks

The adoption of Vision Transformers (ViTs) in resource-constrained applications necessitates improvements in inference throughput. To this end several token pruning and merging approaches have been proposed that improve efficiency by…

Computer Vision and Pattern Recognition · Computer Science 2024-12-03 Benjamin Bergner , Christoph Lippert , Aravindh Mahendran

Efficient Token Compression for Vision Transformer with Spatial Information Preserved

Token compression is essential for reducing the computational and memory requirements of transformer models, enabling their deployment in resource-constrained environments. In this work, we propose an efficient and hardware-compatible token…

Computer Vision and Pattern Recognition · Computer Science 2025-04-01 Junzhu Mao , Yang Shen , Jinyang Guo , Yazhou Yao , Xiansheng Hua

Multi-Scale And Token Mergence: Make Your ViT More Efficient

Since its inception, Vision Transformer (ViT) has emerged as a prevalent model in the computer vision domain. Nonetheless, the multi-head self-attention (MHSA) mechanism in ViT is computationally expensive due to its calculation of…

Computer Vision and Pattern Recognition · Computer Science 2023-07-25 Zhe Bian , Zhe Wang , Wenqiang Han , Kangping Wang

TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model

Vision-Language Models (VLMs) demand substantial computational resources during inference, largely due to the extensive visual input tokens for representing visual information. Previous studies have noted that visual tokens tend to receive…

Computer Vision and Pattern Recognition · Computer Science 2025-04-01 Cheng Yang , Yang Sui , Jinqi Xiao , Lingyi Huang , Yu Gong , Chendi Li , Jinghua Yan , Yu Bai , Ponnuswamy Sadayappan , Xia Hu , Bo Yuan

ToaSt: Token Channel Selection and Structured Pruning for Efficient ViT

Vision Transformers (ViTs) have achieved remarkable success across various vision tasks, yet their deployment is often hindered by prohibitive computational costs. While structured weight pruning and token compression have emerged as…

Computer Vision and Pattern Recognition · Computer Science 2026-02-19 Hyunchan Moon , Cheonjun Park , Steven L. Waslander

Exploring Token Pruning in Vision State Space Models

State Space Models (SSMs) have the advantage of keeping linear computational complexity compared to attention modules in transformers, and have been applied to vision tasks as a new type of powerful vision foundation model. Inspired by the…

Computer Vision and Pattern Recognition · Computer Science 2024-09-30 Zheng Zhan , Zhenglun Kong , Yifan Gong , Yushu Wu , Zichong Meng , Hangyu Zheng , Xuan Shen , Stratis Ioannidis , Wei Niu , Pu Zhao , Yanzhi Wang

Token Compression Meets Compact Vision Transformers: A Survey and Comparative Evaluation for Edge AI

Token compression techniques have recently emerged as powerful tools for accelerating Vision Transformer (ViT) inference in computer vision. Due to the quadratic computational complexity with respect to the token sequence length, these…

Computer Vision and Pattern Recognition · Computer Science 2025-07-15 Phat Nguyen , Ngai-Man Cheung

Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers

Although vision transformers (ViTs) have shown promising results in various computer vision tasks recently, their high computational cost limits their practical applications. Previous approaches that prune redundant tokens have demonstrated…

Computer Vision and Pattern Recognition · Computer Science 2023-04-24 Siyuan Wei , Tianzhu Ye , Shen Zhang , Yao Tang , Jiajun Liang

Block-based Symmetric Pruning and Fusion for Efficient Vision Transformers

Vision Transformer (ViT) has achieved impressive results across various vision tasks, yet its high computational cost limits practical applications. Recent methods have aimed to reduce ViT's $O(n^2)$ complexity by pruning unimportant…

Computer Vision and Pattern Recognition · Computer Science 2025-07-17 Yi-Kuan Hsieh , Jun-Wei Hsieh , Xin Li , Yu-Ming Chang , Yu-Chee Tseng

Segmentwise Pruning in Audio-Language Models

Recent audio-language models have shown impressive performance across a wide range of audio tasks and are increasingly capable of handling long audio inputs. However, the computing costs in these models heavily depend on sequence length,…

Sound · Computer Science 2025-11-19 Marcel Gibier , Raphaël Duroselle , Pierre Serrano , Olivier Boeffard , Jean-François Bonastre

Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention

Vision transformer has emerged as a new paradigm in computer vision, showing excellent performance while accompanied by expensive computational cost. Image token pruning is one of the main approaches for ViT compression, due to the facts…

Computer Vision and Pattern Recognition · Computer Science 2023-07-07 Xiangcheng Liu , Tianyi Wu , Guodong Guo

SPViT: Enabling Faster Vision Transformers via Soft Token Pruning

Recently, Vision Transformer (ViT) has continuously established new milestones in the computer vision field, while the high computation and memory cost makes its propagation in industrial production difficult. Pruning, a traditional model…

Computer Vision and Pattern Recognition · Computer Science 2022-09-22 Zhenglun Kong , Peiyan Dong , Xiaolong Ma , Xin Meng , Mengshu Sun , Wei Niu , Xuan Shen , Geng Yuan , Bin Ren , Minghai Qin , Hao Tang , Yanzhi Wang

PPT: Token Pruning and Pooling for Efficient Vision Transformers

Vision Transformers (ViTs) have emerged as powerful models in the field of computer vision, delivering superior performance across various vision tasks. However, the high computational complexity poses a significant barrier to their…

Computer Vision and Pattern Recognition · Computer Science 2024-02-06 Xinjian Wu , Fanhu Zeng , Xiudong Wang , Xinghao Chen

Accelerating ViT Inference on FPGA through Static and Dynamic Pruning

Vision Transformers (ViTs) have achieved state-of-the-art accuracy on various computer vision tasks. However, their high computational complexity prevents them from being applied to many real-world applications. Weight and token pruning are…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-15 Dhruv Parikh , Shouyi Li , Bingyi Zhang , Rajgopal Kannan , Carl Busart , Viktor Prasanna

VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation

Vision Transformers (ViTs) have emerged as the backbone of many segmentation models, consistently achieving state-of-the-art (SOTA) performance. However, their success comes at a significant computational cost. Image token pruning is one of…

Computer Vision and Pattern Recognition · Computer Science 2024-12-02 Hanning Chen , Yang Ni , Wenjun Huang , Yezi Liu , SungHeon Jeong , Fei Wen , Nathaniel Bastian , Hugo Latapie , Mohsen Imani

Patch Ranking: Efficient CLIP by Learning to Rank Local Patches

Contrastive image-text pre-trained models such as CLIP have shown remarkable adaptability to downstream tasks. However, they face challenges due to the high computational requirements of the Vision Transformer (ViT) backbone. Current…

Computer Vision and Pattern Recognition · Computer Science 2024-12-02 Cheng-En Wu , Jinhong Lin , Yu Hen Hu , Pedro Morgado

Bridging The Gaps Between Token Pruning and Full Pre-training via Masked Fine-tuning

Despite the success of transformers on various computer vision tasks, they suffer from excessive memory and computational cost. Some works present dynamic vision transformers to accelerate inference by pruning redundant tokens. A key to…

Computer Vision and Pattern Recognition · Computer Science 2023-10-27 Fengyuan Shi , Limin Wang

HEART-VIT: Hessian-Guided Efficient Dynamic Attention and Token Pruning in Vision Transformer

Vision Transformers (ViTs) deliver state-of-the-art accuracy but their quadratic attention cost and redundant computations severely hinder deployment on latency and resource-constrained platforms. Existing pruning approaches treat either…

Computer Vision and Pattern Recognition · Computer Science 2025-12-24 Mohammad Helal Uddin , Liam Seymour , Sabur Baidya