Related papers: Adaptive Token Sampling For Efficient Vision Trans…

SaiT: Sparse Vision Transformers through Adaptive Token Pruning

While vision transformers have achieved impressive results, effectively and efficiently accelerating these models can further boost performances. In this work, we propose a dense/sparse training framework to obtain a unified model, enabling…

Computer Vision and Pattern Recognition · Computer Science 2022-10-13 Ling Li , David Thorsley , Joseph Hassoun

Speed-up of Vision Transformer Models by Attention-aware Token Filtering

Vision Transformer (ViT) models have made breakthroughs in image embedding extraction, which provide state-of-the-art performance in tasks such as zero-shot image classification. However, the models suffer from a high computational burden.…

Computer Vision and Pattern Recognition · Computer Science 2025-06-03 Takahiro Naruko , Hiroaki Akutsu

Lossless Token Merging Even Without Fine-Tuning in Vision Transformers

Although Vision Transformers (ViTs) have become the standard architecture in computer vision, their massive sizes lead to significant computational overhead. Token compression techniques have attracted considerable attention to address this…

Computer Vision and Pattern Recognition · Computer Science 2026-04-02 Jaeyeon Lee , Dong-Wan Choi

Make A Long Image Short: Adaptive Token Length for Vision Transformers

The vision transformer is a model that breaks down each image into a sequence of tokens with a fixed length and processes them similarly to words in natural language processing. Although increasing the number of tokens typically results in…

Machine Learning · Computer Science 2023-07-06 Qiqi Zhou , Yichen Zhu

Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning

Recently, foundation models based on Vision Transformers (ViTs) have become widely available. However, their fine-tuning process is highly resource-intensive, and it hinders their adoption in several edge or low-energy applications. To this…

Computer Vision and Pattern Recognition · Computer Science 2024-08-19 Alessio Devoto , Federico Alvetreti , Jary Pomponi , Paolo Di Lorenzo , Pasquale Minervini , Simone Scardapane

SAT: Selective Aggregation Transformer for Image Super-Resolution

Transformer-based approaches have revolutionized image super-resolution by modeling long-range dependencies. However, the quadratic computational complexity of vanilla self-attention mechanisms poses significant challenges, often leading to…

Computer Vision and Pattern Recognition · Computer Science 2026-04-13 Dinh Phu Tran , Thao Do , Saad Wazir , Seongah Kim , Seon Kwon Kim , Daeyoung Kim

AdaViT: Adaptive Tokens for Efficient Vision Transformer

We introduce A-ViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. A-ViT achieves this by automatically reducing the number of tokens in vision transformers that are…

Computer Vision and Pattern Recognition · Computer Science 2022-10-10 Hongxu Yin , Arash Vahdat , Jose Alvarez , Arun Mallya , Jan Kautz , Pavlo Molchanov

Enhancing Transformer-Based Vision Models: Addressing Feature Map Anomalies Through Novel Optimization Strategies

Vision Transformers (ViTs) have demonstrated superior performance across a wide range of computer vision tasks. However, structured noise artifacts in their feature maps hinder downstream applications such as segmentation and depth…

Computer Vision and Pattern Recognition · Computer Science 2025-09-25 Sumit Mamtani

HaltingVT: Adaptive Token Halting Transformer for Efficient Video Recognition

Action recognition in videos poses a challenge due to its high computational cost, especially for Joint Space-Time video transformers (Joint VT). Despite their effectiveness, the excessive number of tokens in such architectures…

Computer Vision and Pattern Recognition · Computer Science 2024-01-11 Qian Wu , Ruoxuan Cui , Yuke Li , Haoqi Zhu

SATA: Spatial Autocorrelation Token Analysis for Enhancing the Robustness of Vision Transformers

Over the past few years, vision transformers (ViTs) have consistently demonstrated remarkable performance across various visual recognition tasks. However, attempts to enhance their robustness have yielded limited success, mainly focusing…

Computer Vision and Pattern Recognition · Computer Science 2024-10-01 Nick Nikzad , Yi Liao , Yongsheng Gao , Jun Zhou

Super Vision Transformer

We attempt to reduce the computational costs in vision transformers (ViTs), which increase quadratically in the token number. We present a novel training paradigm that trains only one ViT model at a time, but is capable of providing…

Computer Vision and Pattern Recognition · Computer Science 2023-07-20 Mingbao Lin , Mengzhao Chen , Yuxin Zhang , Chunhua Shen , Rongrong Ji , Liujuan Cao

AT-SNN: Adaptive Tokens for Vision Transformer on Spiking Neural Network

In the training and inference of spiking neural networks (SNNs), direct training and lightweight computation methods have been orthogonally developed, aimed at reducing power consumption. However, only a limited number of approaches have…

Artificial Intelligence · Computer Science 2024-08-23 Donghwa Kang , Youngmoon Lee , Eun-Kyu Lee , Brent Kang , Jinkyu Lee , Hyeongboo Baek

Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration

Vision transformers have been widely explored in various vision tasks. Due to heavy computational cost, much interest has aroused for compressing vision transformer dynamically in the aspect of tokens. Current methods mainly pay attention…

Computer Vision and Pattern Recognition · Computer Science 2025-06-09 Fanhu Zeng , Deli Yu , Zhenglun Kong , Hao Tang

Make A Long Image Short: Adaptive Token Length for Vision Transformers

The vision transformer splits each image into a sequence of tokens with fixed length and processes the tokens in the same way as words in natural language processing. More tokens normally lead to better performance but considerably…

Computer Vision and Pattern Recognition · Computer Science 2021-12-07 Yichen Zhu , Yuqin Zhu , Jie Du , Yi Wang , Zhicai Ou , Feifei Feng , Jian Tang

Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference

Parameter-efficient fine-tuning (PEFT) has emerged as a popular solution for adapting pre-trained Vision Transformer (ViT) models to downstream applications by updating only a small subset of parameters. While current PEFT methods have…

Computer Vision and Pattern Recognition · Computer Science 2025-12-19 Ting Liu , Xuyang Liu , Liangtao Shi , Zunnan Xu , Yue Hu , Siteng Huang , Yi Xin , Bineng Zhong , Donglin Wang

Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection

Object detection has been dominated by anchor-based detectors for several years. Recently, anchor-free detectors have become popular due to the proposal of FPN and Focal Loss. In this paper, we first point out that the essential difference…

Computer Vision and Pattern Recognition · Computer Science 2020-06-23 Shifeng Zhang , Cheng Chi , Yongqiang Yao , Zhen Lei , Stan Z. Li

Accelerating Vision Transformers with Adaptive Patch Sizes

Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses…

Computer Vision and Pattern Recognition · Computer Science 2026-04-24 Rohan Choudhury , JungEun Kim , Jinhyung Park , Eunho Yang , László A. Jeni , Kris M. Kitani

Vision Transformer with Super Token Sampling

Vision transformer has achieved impressive performance for many vision tasks. However, it may suffer from high redundancy in capturing local features for shallow layers. Local self-attention or early-stage convolutions are thus utilized,…

Computer Vision and Pattern Recognition · Computer Science 2024-01-26 Huaibo Huang , Xiaoqiang Zhou , Jie Cao , Ran He , Tieniu Tan

Vision Transformer with Progressive Sampling

Transformers with powerful global relation modeling abilities have been introduced to fundamental computer vision tasks recently. As a typical example, the Vision Transformer (ViT) directly applies a pure transformer architecture on image…

Computer Vision and Pattern Recognition · Computer Science 2021-08-05 Xiaoyu Yue , Shuyang Sun , Zhanghui Kuang , Meng Wei , Philip Torr , Wayne Zhang , Dahua Lin

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Zichuan Lin , Yicheng Liu , Yang Yang , Lvfang Tao , Deheng Ye