Related papers: Vision Transformers provably learn spatial structu…

Do Vision Transformers See Like Convolutional Neural Networks?

Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This…

Computer Vision and Pattern Recognition · Computer Science 2022-03-07 Maithra Raghu , Thomas Unterthiner , Simon Kornblith , Chiyuan Zhang , Alexey Dosovitskiy

How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

Vision transformer (ViT) expands the success of transformer models from sequential data to images. The model decomposes an image into many smaller patches and arranges them into a sequence. Multi-head self-attentions are then applied to the…

Machine Learning · Computer Science 2023-03-27 Yiran Li , Junpeng Wang , Xin Dai , Liang Wang , Chin-Chia Michael Yeh , Yan Zheng , Wei Zhang , Kwan-Liu Ma

What do Vision Transformers Learn? A Visual Exploration

Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision, yet we understand very little about why they work and what they learn. While existing studies visually analyze the mechanisms of convolutional…

Computer Vision and Pattern Recognition · Computer Science 2022-12-14 Amin Ghiasi , Hamid Kazemi , Eitan Borgnia , Steven Reich , Manli Shu , Micah Goldblum , Andrew Gordon Wilson , Tom Goldstein

Intriguing Properties of Vision Transformers

Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems. These models are based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode…

Computer Vision and Pattern Recognition · Computer Science 2021-11-29 Muzammal Naseer , Kanchana Ranasinghe , Salman Khan , Munawar Hayat , Fahad Shahbaz Khan , Ming-Hsuan Yang

On the Surprising Effectiveness of Attention Transfer for Vision Transformers

Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. Is this actually true? We investigate this question and find that the features and representations…

Machine Learning · Computer Science 2024-11-15 Alexander C. Li , Yuandong Tian , Beidi Chen , Deepak Pathak , Xinlei Chen

Learning Priors of Human Motion With Vision Transformers

A clear understanding of where humans move in a scenario, their usual paths and speeds, and where they stop, is very important for different applications, such as mobility studies in urban areas or robot navigation tasks within…

Computer Vision and Pattern Recognition · Computer Science 2025-01-31 Placido Falqueto , Alberto Sanfeliu , Luigi Palopoli , Daniele Fontanelli

Structured Initialization for Attention in Vision Transformers

The training of vision transformer (ViT) networks on small-scale datasets poses a significant challenge. By contrast, convolutional neural networks (CNNs) have an architectural inductive bias enabling them to perform well on such problems.…

Computer Vision and Pattern Recognition · Computer Science 2024-04-02 Jianqiao Zheng , Xueqian Li , Simon Lucey

Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields

Vision transformers (ViTs) that model an image as a sequence of partitioned patches have shown notable performance in diverse vision tasks. Because partitioning patches eliminates the image structure, to reflect the order of patches, ViTs…

Computer Vision and Pattern Recognition · Computer Science 2023-05-09 Bum Jun Kim , Hyeyeon Choi , Hyeonah Jang , Sang Woo Kim

Vision Transformers: From Semantic Segmentation to Dense Prediction

The emergence of vision transformers (ViTs) in image classification has shifted the methodologies for visual representation learning. In particular, ViTs learn visual representation at full receptive field per layer across all the image…

Computer Vision and Pattern Recognition · Computer Science 2024-08-05 Li Zhang , Jiachen Lu , Sixiao Zheng , Xinxuan Zhao , Xiatian Zhu , Yanwei Fu , Tao Xiang , Jianfeng Feng , Philip H. S. Torr

ViT-P: Rethinking Data-efficient Vision Transformers from Locality

Recent advances of Transformers have brought new trust to computer vision tasks. However, on small dataset, Transformers is hard to train and has lower performance than convolutional neural networks. We make vision transformers as…

Computer Vision and Pattern Recognition · Computer Science 2022-03-07 Bin Chen , Ran Wang , Di Ming , Xin Feng

Teaching Matters: Investigating the Role of Supervision in Vision Transformers

Vision Transformers (ViTs) have gained significant popularity in recent years and have proliferated into many applications. However, their behavior under different learning paradigms is not well explored. We compare ViTs trained through…

Computer Vision and Pattern Recognition · Computer Science 2023-04-07 Matthew Walmer , Saksham Suri , Kamal Gupta , Abhinav Shrivastava

DeepViT: Towards Deeper Vision Transformer

Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the…

Computer Vision and Pattern Recognition · Computer Science 2021-04-20 Daquan Zhou , Bingyi Kang , Xiaojie Jin , Linjie Yang , Xiaochen Lian , Zihang Jiang , Qibin Hou , Jiashi Feng

Learning Spatial Decay for Vision Transformers

Vision Transformers (ViTs) have revolutionized computer vision, yet their self-attention mechanism lacks explicit spatial inductive biases, leading to suboptimal performance on spatially-structured tasks. Existing approaches introduce…

Computer Vision and Pattern Recognition · Computer Science 2025-12-30 Yuxin Mao , Zhen Qin , Jinxing Zhou , Bin Fan , Jing Zhang , Yiran Zhong , Yuchao Dai

Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference

Vision Transformers (ViTs) have shown impressive performance but still require a high computation cost as compared to convolutional neural networks (CNNs), one reason is that ViTs' attention measures global similarities and thus has a…

Computer Vision and Pattern Recognition · Computer Science 2024-07-26 Haoran You , Yunyang Xiong , Xiaoliang Dai , Bichen Wu , Peizhao Zhang , Haoqi Fan , Peter Vajda , Yingyan Celine Lin

Training Vision Transformers with Only 2040 Images

Vision Transformers (ViTs) is emerging as an alternative to convolutional neural networks (CNNs) for visual recognition. They achieve competitive results with CNNs but the lack of the typical convolutional inductive bias makes them more…

Computer Vision and Pattern Recognition · Computer Science 2022-01-27 Yun-Hao Cao , Hao Yu , Jianxin Wu

Surface Analysis with Vision Transformers

The extension of convolutional neural networks (CNNs) to non-Euclidean geometries has led to multiple frameworks for studying manifolds. Many of those methods have shown design limitations resulting in poor modelling of long-range…

Computer Vision and Pattern Recognition · Computer Science 2022-06-01 Simon Dahan , Logan Z. J. Williams , Abdulah Fawaz , Daniel Rueckert , Emma C. Robinson

Interpretable Vision Transformers in Image Classification via SVDA

Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors. In this work, we adapt our previously proposed…

Computer Vision and Pattern Recognition · Computer Science 2026-02-12 Vasileios Arampatzakis , George Pavlidis , Nikolaos Mitianoudis , Nikos Papamarkos

Rethinking Spatial Dimensions of Vision Transformers

Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks as being an alternative architecture against the existing convolutional neural networks (CNN). Since the…

Computer Vision and Pattern Recognition · Computer Science 2021-08-19 Byeongho Heo , Sangdoo Yun , Dongyoon Han , Sanghyuk Chun , Junsuk Choe , Seong Joon Oh

Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention

Vision Transformer(ViT) is one of the most widely used models in the computer vision field with its great performance on various tasks. In order to fully utilize the ViT-based architecture in various applications, proper visualization…

Computer Vision and Pattern Recognition · Computer Science 2024-02-08 Saebom Leem , Hyunseok Seo

A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation

Vision-transformers (ViTs) and large-scale convolution-neural-networks (CNNs) have reshaped computer vision through pretrained feature representations that enable strong transfer learning for diverse tasks. However, their efficiency as…

Computer Vision and Pattern Recognition · Computer Science 2025-10-07 Alon Kaya , Igal Bilik , Inna Stainvas