Related papers: TinyDrop: Tiny Model Guided Token Dropping for Vis…

Self-slimmed Vision Transformer

Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks. However, such powerful transformers bring a huge computation burden, because of the exhausting…

Computer Vision and Pattern Recognition · Computer Science 2022-09-13 Zhuofan Zong , Kunchang Li , Guanglu Song , Yali Wang , Yu Qiao , Biao Leng , Yu Liu

Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model

The vision tokens in multimodal large language models usually exhibit significant spatial and temporal redundancy and take up most of the input tokens, which harms their inference efficiency. To solve this problem, some recent works were…

Computer Vision and Pattern Recognition · Computer Science 2024-11-19 Ting Liu , Liangtao Shi , Richang Hong , Yue Hu , Quanjun Yin , Linfeng Zhang

Super Vision Transformer

We attempt to reduce the computational costs in vision transformers (ViTs), which increase quadratically in the token number. We present a novel training paradigm that trains only one ViT model at a time, but is capable of providing…

Computer Vision and Pattern Recognition · Computer Science 2023-07-20 Mingbao Lin , Mengzhao Chen , Yuxin Zhang , Chunhua Shen , Rongrong Ji , Liujuan Cao

Rethinking Visual Token Reduction in LVLMs Under Cross-Modal Misalignment

Large Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics. These visual tokens often outnumber their textual counterparts by a large margin, leading to substantial…

Computer Vision and Pattern Recognition · Computer Science 2026-03-03 Rui Xu , Yunke Wang , Yong Luo , Bo Du

TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving

Vision Language Models (VLMs) employed for visual question-answering (VQA) in autonomous driving often require substantial computational resources that pose a challenge for their deployment in resource-constrained vehicles. To address this…

Computer Vision and Pattern Recognition · Computer Science 2025-05-22 Hossein Hassani , Soodeh Nikan , Abdallah Shami

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

In large vision-language models (LVLMs), images serve as inputs that carry a wealth of information. As the idiom "A picture is worth a thousand words" implies, representing a single image in current LVLMs can require hundreds or even…

Computer Vision and Pattern Recognition · Computer Science 2025-02-28 Long Xing , Qidong Huang , Xiaoyi Dong , Jiajie Lu , Pan Zhang , Yuhang Zang , Yuhang Cao , Conghui He , Jiaqi Wang , Feng Wu , Dahua Lin

TinyViT: Fast Pretraining Distillation for Small Vision Transformers

Vision transformer (ViT) recently has drawn great attention in computer vision due to its remarkable model capability. However, most prevailing ViT models suffer from huge number of parameters, restricting their applicability on devices…

Computer Vision and Pattern Recognition · Computer Science 2022-07-22 Kan Wu , Jinnian Zhang , Houwen Peng , Mengchen Liu , Bin Xiao , Jianlong Fu , Lu Yuan

Powerful Design of Small Vision Transformer on CIFAR10

Vision Transformers (ViTs) have demonstrated remarkable success on large-scale datasets, but their performance on smaller datasets often falls short of convolutional neural networks (CNNs). This paper explores the design and optimization of…

Machine Learning · Computer Science 2025-01-14 Gent Wu

Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts,…

Computer Vision and Pattern Recognition · Computer Science 2026-02-26 Junjie Chen , Xuyang Liu , Zichen Wen , Yiyu Wang , Siteng Huang , Honggang Chen

Which Tokens to Use? Investigating Token Reduction in Vision Transformers

Since the introduction of the Vision Transformer (ViT), researchers have sought to make ViTs more efficient by removing redundant information in the processed tokens. While different methods have been explored to achieve this goal, we still…

Computer Vision and Pattern Recognition · Computer Science 2023-08-10 Joakim Bruslund Haurum , Sergio Escalera , Graham W. Taylor , Thomas B. Moeslund

ToFe: Lagged Token Freezing and Reusing for Efficient Vision Transformer Inference

Although vision transformers (ViT) have shown remarkable success in various vision tasks, their computationally expensive self-attention hinder their deployment on resource-constrained devices. Token reduction, which discards less important…

Computer Vision and Pattern Recognition · Computer Science 2025-07-23 Haoyue Zhang , Jie Zhang , Song Guo

Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning

Token compression expedites the training and inference of Vision Transformers (ViTs) by reducing the number of the redundant tokens, e.g., pruning inattentive tokens or merging similar tokens. However, when applied to downstream tasks,…

Computer Vision and Pattern Recognition · Computer Science 2024-08-14 Shibo Jie , Yehui Tang , Jianyuan Guo , Zhi-Hong Deng , Kai Han , Yunhe Wang

Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration

Vision transformers have been widely explored in various vision tasks. Due to heavy computational cost, much interest has aroused for compressing vision transformer dynamically in the aspect of tokens. Current methods mainly pay attention…

Computer Vision and Pattern Recognition · Computer Science 2025-06-09 Fanhu Zeng , Deli Yu , Zhenglun Kong , Hao Tang

Training-free Token Reduction for Vision Mamba

Vision Mamba has emerged as a strong competitor to Vision Transformers (ViTs) due to its ability to efficiently capture long-range dependencies with linear computational complexity. While token reduction, an effective compression technique…

Computer Vision and Pattern Recognition · Computer Science 2025-07-21 Qiankun Ma , Ziyao Zhang , Chi Su , Jie Chen , Zhen Song , Hairong Zheng , Wen Gao

VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation

Vision Transformers (ViTs) have emerged as the backbone of many segmentation models, consistently achieving state-of-the-art (SOTA) performance. However, their success comes at a significant computational cost. Image token pruning is one of…

Computer Vision and Pattern Recognition · Computer Science 2024-12-02 Hanning Chen , Yang Ni , Wenjun Huang , Yezi Liu , SungHeon Jeong , Fei Wen , Nathaniel Bastian , Hugo Latapie , Mohsen Imani

Token Cropr: Faster ViTs for Quite a Few Tasks

The adoption of Vision Transformers (ViTs) in resource-constrained applications necessitates improvements in inference throughput. To this end several token pruning and merging approaches have been proposed that improve efficiency by…

Computer Vision and Pattern Recognition · Computer Science 2024-12-03 Benjamin Bergner , Christoph Lippert , Aravindh Mahendran

PatchDropout: Economizing Vision Transformers Using Patch Dropout

Vision transformers have demonstrated the potential to outperform CNNs in a variety of vision tasks. But the computational and memory requirements of these models prohibit their use in many applications, especially those that depend on…

Computer Vision and Pattern Recognition · Computer Science 2022-10-06 Yue Liu , Christos Matsoukas , Fredrik Strand , Hossein Azizpour , Kevin Smith

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed…

Computer Vision and Pattern Recognition · Computer Science 2021-12-01 Li Yuan , Yunpeng Chen , Tao Wang , Weihao Yu , Yujun Shi , Zihang Jiang , Francis EH Tay , Jiashi Feng , Shuicheng Yan

AdapterDrop: On the Efficiency of Adapters in Transformers

Massively pre-trained transformer models are computationally expensive to fine-tune, slow for inference, and have large storage requirements. Recent approaches tackle these shortcomings by training smaller models, dynamically reducing the…

Machine Learning · Computer Science 2021-10-07 Andreas Rücklé , Gregor Geigle , Max Glockner , Tilman Beck , Jonas Pfeiffer , Nils Reimers , Iryna Gurevych

All Tokens Matter: Token Labeling for Training Better Vision Transformers

In this paper, we present token labeling -- a new training objective for training high-performance vision transformers (ViTs). Different from the standard training objective of ViTs that computes the classification loss on an additional…

Computer Vision and Pattern Recognition · Computer Science 2021-06-10 Zihang Jiang , Qibin Hou , Li Yuan , Daquan Zhou , Yujun Shi , Xiaojie Jin , Anran Wang , Jiashi Feng