Related papers: How Do Vision Transformers Work?

Optimizing Vision Transformers for Medical Image Segmentation

For medical image semantic segmentation (MISS), Vision Transformers have emerged as strong alternatives to convolutional neural networks thanks to their inherent ability to capture long-range correlations. However, existing research uses…

Computer Vision and Pattern Recognition · Computer Science 2023-06-06 Qianying Liu , Chaitanya Kaul , Jun Wang , Christos Anagnostopoulos , Roderick Murray-Smith , Fani Deligianni

Searching for Efficient Multi-Stage Vision Transformers

Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applied to computer vision tasks and result in comparable performance to convolutional neural networks (CNN), which have been studied and adopted…

Computer Vision and Pattern Recognition · Computer Science 2021-09-03 Yi-Lun Liao , Sertac Karaman , Vivienne Sze

Scratching Visual Transformer's Back with Uniform Attention

The favorable performance of Vision Transformers (ViTs) is often attributed to the multi-head self-attention (MSA). The MSA enables global interactions at each layer of a ViT model, which is a contrasting feature against Convolutional…

Computer Vision and Pattern Recognition · Computer Science 2024-12-30 Nam Hyeon-Woo , Kim Yu-Ji , Byeongho Heo , Dongyoon Han , Seong Joon Oh , Tae-Hyun Oh

Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work

Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tasks, compare to Convolutional Neural Networks (CNNs). As a demanding technique in computer vision, ViTs have been successfully solved various…

Computer Vision and Pattern Recognition · Computer Science 2023-10-18 Khawar Islam

Do Vision Transformers See Like Convolutional Neural Networks?

Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This…

Computer Vision and Pattern Recognition · Computer Science 2022-03-07 Maithra Raghu , Thomas Unterthiner , Simon Kornblith , Chiyuan Zhang , Alexey Dosovitskiy

A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation

Vision-transformers (ViTs) and large-scale convolution-neural-networks (CNNs) have reshaped computer vision through pretrained feature representations that enable strong transfer learning for diverse tasks. However, their efficiency as…

Computer Vision and Pattern Recognition · Computer Science 2025-10-07 Alon Kaya , Igal Bilik , Inna Stainvas

Surface Analysis with Vision Transformers

The extension of convolutional neural networks (CNNs) to non-Euclidean geometries has led to multiple frameworks for studying manifolds. Many of those methods have shown design limitations resulting in poor modelling of long-range…

Computer Vision and Pattern Recognition · Computer Science 2022-06-01 Simon Dahan , Logan Z. J. Williams , Abdulah Fawaz , Daniel Rueckert , Emma C. Robinson

Demystify Self-Attention in Vision Transformers from a Semantic Perspective: Analysis and Application

Self-attention mechanisms, especially multi-head self-attention (MSA), have achieved great success in many fields such as computer vision and natural language processing. However, many existing vision transformer (ViT) works simply inherent…

Computer Vision and Pattern Recognition · Computer Science 2022-11-17 Leijie Wu , Song Guo , Yaohong Ding , Junxiao Wang , Wenchao Xu , Richard Yida Xu , Jie Zhang

ConvNets vs. Transformers: Whose Visual Representations are More Transferable?

Vision transformers have attracted much attention from computer vision researchers as they are not restricted to the spatial inductive bias of ConvNets. However, although Transformer-based backbones have achieved much progress on ImageNet…

Computer Vision and Pattern Recognition · Computer Science 2021-08-18 Hong-Yu Zhou , Chixiang Lu , Sibei Yang , Yizhou Yu

Vision Backbone Enhancement via Multi-Stage Cross-Scale Attention

Convolutional neural networks (CNNs) and vision transformers (ViTs) have achieved remarkable success in various vision tasks. However, many architectures do not consider interactions between feature maps from different stages and scales,…

Computer Vision and Pattern Recognition · Computer Science 2023-08-16 Liang Shang , Yanli Liu , Zhengyang Lou , Shuxue Quan , Nagesh Adluru , Bochen Guan , William A. Sethares

What do Vision Transformers Learn? A Visual Exploration

Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision, yet we understand very little about why they work and what they learn. While existing studies visually analyze the mechanisms of convolutional…

Computer Vision and Pattern Recognition · Computer Science 2022-12-14 Amin Ghiasi , Hamid Kazemi , Eitan Borgnia , Steven Reich , Manli Shu , Micah Goldblum , Andrew Gordon Wilson , Tom Goldstein

On Vision Transformers for Classification Tasks in Side-Scan Sonar Imagery

Side-scan sonar (SSS) imagery presents unique challenges in the classification of man-made objects on the seafloor due to the complex and varied underwater environments. Historically, experts have manually interpreted SSS images, relying on…

Computer Vision and Pattern Recognition · Computer Science 2024-09-19 BW Sheffield , Jeffrey Ellen , Ben Whitmore

How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

Vision transformer (ViT) expands the success of transformer models from sequential data to images. The model decomposes an image into many smaller patches and arranges them into a sequence. Multi-head self-attentions are then applied to the…

Machine Learning · Computer Science 2023-03-27 Yiran Li , Junpeng Wang , Xin Dai , Liang Wang , Chin-Chia Michael Yeh , Yan Zheng , Wei Zhang , Kwan-Liu Ma

Masked autoencoders are effective solution to transformer data-hungry

Vision Transformers (ViTs) outperforms convolutional neural networks (CNNs) in several vision tasks with its global modeling capabilities. However, ViT lacks the inductive bias inherent to convolution making it require a large amount of…

Computer Vision and Pattern Recognition · Computer Science 2023-01-11 Jiawei Mao , Honggu Zhou , Xuesong Yin , Yuanqi Chang. Binling Nie. Rui Xu

Vision Transformers provably learn spatial structure

Vision Transformers (ViTs) have achieved comparable or superior performance than Convolutional Neural Networks (CNNs) in computer vision. This empirical breakthrough is even more remarkable since, in contrast to CNNs, ViTs do not embed any…

Computer Vision and Pattern Recognition · Computer Science 2022-10-18 Samy Jelassi , Michael E. Sander , Yuanzhi Li

Multi-Attribute Vision Transformers are Efficient and Robust Learners

Since their inception, Vision Transformers (ViTs) have emerged as a compelling alternative to Convolutional Neural Networks (CNNs) across a wide spectrum of tasks. ViTs exhibit notable characteristics, including global attention, resilience…

Computer Vision and Pattern Recognition · Computer Science 2024-07-22 Hanan Gani , Nada Saadi , Noor Hussein , Karthik Nandakumar

Vision Transformer for Small-Size Datasets

Recently, the Vision Transformer (ViT), which applied the transformer structure to the image classification task, has outperformed convolutional neural networks. However, the high performance of the ViT results from pre-training using a…

Computer Vision and Pattern Recognition · Computer Science 2021-12-28 Seung Hoon Lee , Seunghyun Lee , Byung Cheol Song

Teaching Matters: Investigating the Role of Supervision in Vision Transformers

Vision Transformers (ViTs) have gained significant popularity in recent years and have proliferated into many applications. However, their behavior under different learning paradigms is not well explored. We compare ViTs trained through…

Computer Vision and Pattern Recognition · Computer Science 2023-04-07 Matthew Walmer , Saksham Suri , Kamal Gupta , Abhinav Shrivastava

How to Train Vision Transformer on Small-scale Datasets?

Vision Transformer (ViT), a radically different architecture than convolutional neural networks offers multiple advantages including design simplicity, robustness and state-of-the-art performance on many vision tasks. However, in contrast…

Computer Vision and Pattern Recognition · Computer Science 2022-10-14 Hanan Gani , Muzammal Naseer , Mohammad Yaqub

Hands-on Evaluation of Visual Transformers for Object Recognition and Detection

Convolutional Neural Networks (CNNs) for computer vision sometimes struggle with understanding images in a global context, as they mainly focus on local patterns. On the other hand, Vision Transformers (ViTs), inspired by models originally…

Computer Vision and Pattern Recognition · Computer Science 2025-12-11 Dimitrios N. Vlachogiannis , Dimitrios A. Koutsomitropoulos