Related papers: Accelerating Vision Transformers with Adaptive Pat…

Effect of Patch Size on Fine-Tuning Vision Transformers in Two-Dimensional and Three-Dimensional Medical Image Classification

Vision Transformers (ViTs) and their variants have become state-of-the-art in many computer vision tasks and are widely used as backbones in large-scale vision and vision-language foundation models. While substantial research has focused on…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Massoud Dehghan , Ramona Woitek , Amirreza Mahbod

VariViT: A Vision Transformer for Variable Image Sizes

Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning, leveraging self-attention mechanisms to excel in various tasks. ViTs split images into fixed-size patches, constraining them to a…

Computer Vision and Pattern Recognition · Computer Science 2026-02-17 Aswathi Varma , Suprosanna Shit , Chinmay Prabhakar , Daniel Scholz , Hongwei Bran Li , Bjoern Menze , Daniel Rueckert , Benedikt Wiestler

AdaViT: Adaptive Tokens for Efficient Vision Transformer

We introduce A-ViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. A-ViT achieves this by automatically reducing the number of tokens in vision transformers that are…

Computer Vision and Pattern Recognition · Computer Science 2022-10-10 Hongxu Yin , Arash Vahdat , Jose Alvarez , Arun Mallya , Jan Kautz , Pavlo Molchanov

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT)…

Computer Vision and Pattern Recognition · Computer Science 2023-07-13 Mostafa Dehghani , Basil Mustafa , Josip Djolonga , Jonathan Heek , Matthias Minderer , Mathilde Caron , Andreas Steiner , Joan Puigcerver , Robert Geirhos , Ibrahim Alabdulmohsin , Avital Oliver , Piotr Padlewski , Alexey Gritsenko , Mario Lučić , Neil Houlsby

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Built on top of self-attention mechanisms, vision transformers have demonstrated remarkable performance on a variety of vision tasks recently. While achieving excellent performance, they still require relatively intensive computational cost…

Computer Vision and Pattern Recognition · Computer Science 2021-12-01 Lingchen Meng , Hengduo Li , Bor-Chun Chen , Shiyi Lan , Zuxuan Wu , Yu-Gang Jiang , Ser-Nam Lim

Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation

We investigate the robustness of vision transformers (ViTs) through the lens of their special patch-based architectural structure, i.e., they process an image as a sequence of image patches. We find that ViTs are surprisingly insensitive to…

Machine Learning · Computer Science 2023-02-23 Yao Qin , Chiyuan Zhang , Ting Chen , Balaji Lakshminarayanan , Alex Beutel , Xuezhi Wang

FlexiViT: One Model for All Patch Sizes

Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the…

Computer Vision and Pattern Recognition · Computer Science 2023-03-27 Lucas Beyer , Pavel Izmailov , Alexander Kolesnikov , Mathilde Caron , Simon Kornblith , Xiaohua Zhai , Matthias Minderer , Michael Tschannen , Ibrahim Alabdulmohsin , Filip Pavetic

Intriguing Properties of Vision Transformers

Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems. These models are based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode…

Computer Vision and Pattern Recognition · Computer Science 2021-11-29 Muzammal Naseer , Kanchana Ranasinghe , Salman Khan , Munawar Hayat , Fahad Shahbaz Khan , Ming-Hsuan Yang

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head self-attention (MHSA) among them. Complete leverage of these image tokens brings redundant computations since not all the tokens are attentive in MHSA.…

Computer Vision and Pattern Recognition · Computer Science 2022-04-15 Youwei Liang , Chongjian Ge , Zhan Tong , Yibing Song , Jue Wang , Pengtao Xie

MSPE: Multi-Scale Patch Embedding Prompts Vision Transformers to Any Resolution

Although Vision Transformers (ViTs) have recently advanced computer vision tasks significantly, an important real-world problem was overlooked: adapting to variable input resolutions. Typically, images are resized to a fixed resolution,…

Computer Vision and Pattern Recognition · Computer Science 2024-05-29 Wenzhuo Liu , Fei Zhu , Shijie Ma , Cheng-Lin Liu

Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies

In recent years, vision transformers (ViTs) have emerged as powerful and promising techniques for computer vision tasks such as image classification, object detection, and segmentation. Unlike convolutional neural networks (CNNs), which…

Computer Vision and Pattern Recognition · Computer Science 2025-05-20 Shaibal Saha , Lanyu Xu

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Pretraining Vision Transformers (ViTs) has achieved great success in visual recognition. A following scenario is to adapt a ViT to various image and video recognition tasks. The adaptation is challenging because of heavy computation and…

Computer Vision and Pattern Recognition · Computer Science 2022-10-18 Shoufa Chen , Chongjian Ge , Zhan Tong , Jiangliu Wang , Yibing Song , Jue Wang , Ping Luo

Vision Transformer for Small-Size Datasets

Recently, the Vision Transformer (ViT), which applied the transformer structure to the image classification task, has outperformed convolutional neural networks. However, the high performance of the ViT results from pre-training using a…

Computer Vision and Pattern Recognition · Computer Science 2021-12-28 Seung Hoon Lee , Seunghyun Lee , Byung Cheol Song

Accelerating Vision Transformer Training via a Patch Sampling Schedule

We introduce the notion of a Patch Sampling Schedule (PSS), that varies the number of Vision Transformer (ViT) patches used per batch during training. Since all patches are not equally important for most vision objectives (e.g.,…

Computer Vision and Pattern Recognition · Computer Science 2022-08-23 Bradley McDanel , Chi Phuong Huynh

Make A Long Image Short: Adaptive Token Length for Vision Transformers

The vision transformer is a model that breaks down each image into a sequence of tokens with a fixed length and processes them similarly to words in natural language processing. Although increasing the number of tokens typically results in…

Machine Learning · Computer Science 2023-07-06 Qiqi Zhou , Yichen Zhu

CF-ViT: A General Coarse-to-Fine Method for Vision Transformer

Vision Transformers (ViT) have made many breakthroughs in computer vision tasks. However, considerable redundancy arises in the spatial dimension of an input image, leading to massive computational costs. Therefore, We propose a…

Computer Vision and Pattern Recognition · Computer Science 2022-11-22 Mengzhao Chen , Mingbao Lin , Ke Li , Yunhang Shen , Yongjian Wu , Fei Chao , Rongrong Ji

Efficient Vision Transformer for Human Pose Estimation via Patch Selection

While Convolutional Neural Networks (CNNs) have been widely successful in 2D human pose estimation, Vision Transformers (ViTs) have emerged as a promising alternative to CNNs, boosting state-of-the-art performance. However, the quadratic…

Computer Vision and Pattern Recognition · Computer Science 2023-11-23 Kaleab A. Kinfu , Rene Vidal

Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning

Recently, foundation models based on Vision Transformers (ViTs) have become widely available. However, their fine-tuning process is highly resource-intensive, and it hinders their adoption in several edge or low-energy applications. To this…

Computer Vision and Pattern Recognition · Computer Science 2024-08-19 Alessio Devoto , Federico Alvetreti , Jary Pomponi , Paolo Di Lorenzo , Pasquale Minervini , Simone Scardapane

Make A Long Image Short: Adaptive Token Length for Vision Transformers

The vision transformer splits each image into a sequence of tokens with fixed length and processes the tokens in the same way as words in natural language processing. More tokens normally lead to better performance but considerably…

Computer Vision and Pattern Recognition · Computer Science 2021-12-07 Yichen Zhu , Yuqin Zhu , Jie Du , Yi Wang , Zhicai Ou , Feifei Feng , Jian Tang

Compress image to patches for Vision Transformer

The Vision Transformer (ViT) has made significant strides in the field of computer vision. However, as the depth of the model and the resolution of the input images increase, the computational cost associated with training and running ViT…

Computer Vision and Pattern Recognition · Computer Science 2025-02-18 Xinfeng Zhao , Yaoru Sun