Related papers: Vision Transformers with Patch Diversification

Learning Diverse Features in Vision Transformers for Improved Generalization

Deep learning models often rely only on a small set of features even when there is a rich set of predictive signals in the training data. This makes models brittle and sensitive to distribution shifts. In this work, we first examine vision…

Computer Vision and Pattern Recognition · Computer Science 2023-09-01 Armand Mihai Nicolicioiu , Andrei Liviu Nicolicioiu , Bogdan Alexe , Damien Teney

Locality-Attending Vision Transformer

Vision transformers have demonstrated remarkable success in classification by leveraging global self-attention to capture long-range dependencies. However, this same mechanism can obscure fine-grained spatial details crucial for tasks such…

Computer Vision and Pattern Recognition · Computer Science 2026-03-06 Sina Hajimiri , Farzad Beizaee , Fereshteh Shakeri , Christian Desrosiers , Ismail Ben Ayed , Jose Dolz

Boosting vision transformers for image retrieval

Vision transformers have achieved remarkable progress in vision tasks such as image classification and detection. However, in instance-level image retrieval, transformers have not yet shown good performance compared to convolutional…

Computer Vision and Pattern Recognition · Computer Science 2022-10-24 Chull Hwan Song , Jooyoung Yoon , Shunghyun Choi , Yannis Avrithis

Three things everyone should know about Vision Transformers

After their initial success in natural language processing, transformer architectures have rapidly gained traction in computer vision, providing state-of-the-art results for tasks such as image classification, detection, segmentation, and…

Computer Vision and Pattern Recognition · Computer Science 2022-03-21 Hugo Touvron , Matthieu Cord , Alaaeldin El-Nouby , Jakob Verbeek , Hervé Jégou

An Empirical Study of Training Self-Supervised Vision Transformers

This paper does not describe a novel method. Instead, it studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Vision Transformers (ViT). While the training…

Computer Vision and Pattern Recognition · Computer Science 2021-08-17 Xinlei Chen , Saining Xie , Kaiming He

Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation

We investigate the robustness of vision transformers (ViTs) through the lens of their special patch-based architectural structure, i.e., they process an image as a sequence of image patches. We find that ViTs are surprisingly insensitive to…

Machine Learning · Computer Science 2023-02-23 Yao Qin , Chiyuan Zhang , Ting Chen , Balaji Lakshminarayanan , Alex Beutel , Xuezhi Wang

The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy

Vision transformers (ViTs) have gained increasing popularity as they are commonly believed to own higher modeling capacity and representation flexibility, than traditional convolutional networks. However, it is questionable whether such…

Machine Learning · Computer Science 2022-03-15 Tianlong Chen , Zhenyu Zhang , Yu Cheng , Ahmed Awadallah , Zhangyang Wang

DeepViT: Towards Deeper Vision Transformer

Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the…

Computer Vision and Pattern Recognition · Computer Science 2021-04-20 Daquan Zhou , Bingyi Kang , Xiaojie Jin , Linjie Yang , Xiaochen Lian , Zihang Jiang , Qibin Hou , Jiashi Feng

Vision Transformers Need Better Token Interaction

Vision Transformers (ViTs) can learn strong image-level representations while their patch representations become less effective for dense prediction during prolonged training. We revisit this dense degradation phenomenon and argue that it…

Computer Vision and Pattern Recognition · Computer Science 2026-05-25 Linxiang Su

Vision Transformers are Robust Learners

Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art…

Computer Vision and Pattern Recognition · Computer Science 2021-12-07 Sayak Paul , Pin-Yu Chen

PatchRot: A Self-Supervised Technique for Training Vision Transformers

Vision transformers require a huge amount of labeled data to outperform convolutional neural networks. However, labeling a huge dataset is a very expensive process. Self-supervised learning techniques alleviate this problem by learning…

Computer Vision and Pattern Recognition · Computer Science 2022-10-31 Sachin Chhabra , Prabal Bijoy Dutta , Hemanth Venkateswara , Baoxin Li

Vision Transformer Finetuning Benefits from Non-Smooth Components

The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper,…

Machine Learning · Computer Science 2026-02-10 Ambroise Odonnat , Laetitia Chapel , Romain Tavenard , Ievgen Redko

Patch Is Not All You Need

Vision Transformers have achieved great success in computer visions, delivering exceptional performance across various tasks. However, their inherent reliance on sequential input enforces the manual partitioning of images into patch…

Computer Vision and Pattern Recognition · Computer Science 2023-08-22 Changzhen Li , Jie Zhang , Yang Wei , Zhilong Ji , Jinfeng Bai , Shiguang Shan

Patch Pruning Strategy Based on Robust Statistical Measures of Attention Weight Diversity in Vision Transformers

Multi-head self-attention is a distinctive feature extraction mechanism of vision transformers that computes pairwise relationships among all input patches, contributing significantly to their high performance. However, it is known to incur…

Computer Vision and Pattern Recognition · Computer Science 2025-07-28 Yuki Igaue , Hiroaki Aizawa

Exploring and Improving Mobile Level Vision Transformers

We study the vision transformer structure in the mobile level in this paper, and find a dramatic performance drop. We analyze the reason behind this phenomenon, and propose a novel irregular patch embedding module and adaptive patch fusion…

Computer Vision and Pattern Recognition · Computer Science 2021-08-31 Pengguang Chen , Yixin Chen , Shu Liu , Mingchang Yang , Jiaya Jia

Vicinity Vision Transformer

Vision transformers have shown great success on numerous computer vision tasks. However, its central component, softmax attention, prohibits vision transformers from scaling up to high-resolution images, due to both the computational…

Computer Vision and Pattern Recognition · Computer Science 2023-07-21 Weixuan Sun , Zhen Qin , Hui Deng , Jianyuan Wang , Yi Zhang , Kaihao Zhang , Nick Barnes , Stan Birchfield , Lingpeng Kong , Yiran Zhong

Demystify Transformers & Convolutions in Modern Image Deep Networks

Vision transformers have gained popularity recently, leading to the development of new vision backbones with improved features and consistent performance gains. However, these advancements are not solely attributable to novel feature…

Computer Vision and Pattern Recognition · Computer Science 2025-06-23 Xiaowei Hu , Min Shi , Weiyun Wang , Sitong Wu , Linjie Xing , Wenhai Wang , Xizhou Zhu , Lewei Lu , Jie Zhou , Xiaogang Wang , Yu Qiao , Jifeng Dai

Attention Deficit is Ordered! Fooling Deformable Vision Transformers with Collaborative Adversarial Patches

The latest generation of transformer-based vision models has proven to be superior to Convolutional Neural Network (CNN)-based models across several vision tasks, largely attributed to their remarkable prowess in relation modeling.…

Computer Vision and Pattern Recognition · Computer Science 2023-12-29 Quazi Mishkatul Alam , Bilel Tarchoun , Ihsen Alouani , Nael Abu-Ghazaleh

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Built on top of self-attention mechanisms, vision transformers have demonstrated remarkable performance on a variety of vision tasks recently. While achieving excellent performance, they still require relatively intensive computational cost…

Computer Vision and Pattern Recognition · Computer Science 2021-12-01 Lingchen Meng , Hengduo Li , Bor-Chun Chen , Shiyi Lan , Zuxuan Wu , Yu-Gang Jiang , Ser-Nam Lim

Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization

Transformers have demonstrated great power in the recent development of large foundational models. In particular, the Vision Transformer (ViT) has brought revolutionary changes to the field of vision, achieving significant accomplishments…

Machine Learning · Computer Science 2024-11-25 Jiarui Jiang , Wei Huang , Miao Zhang , Taiji Suzuki , Liqiang Nie