Related papers: Can Vision Transformers Perform Convolution?

Vision Conformer: Incorporating Convolutions into Vision Transformer Layers

Transformers are popular neural network models that use layers of self-attention and fully-connected nodes with embedded tokens. Vision Transformers (ViT) adapt transformers for image recognition tasks. In order to do this, the images are…

Computer Vision and Pattern Recognition · Computer Science 2023-04-28 Brian Kenji Iwana , Akihiro Kusuda

Transformed CNNs: recasting pre-trained convolutional layers with self-attention

Vision Transformers (ViT) have recently emerged as a powerful alternative to convolutional networks (CNNs). Although hybrid models attempt to bridge the gap between these two architectures, the self-attention layers they rely on induce a…

Machine Learning · Computer Science 2021-06-11 Stéphane d'Ascoli , Levent Sagun , Giulio Biroli , Ari Morcos

Do Vision Transformers See Like Convolutional Neural Networks?

Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This…

Computer Vision and Pattern Recognition · Computer Science 2022-03-07 Maithra Raghu , Thomas Unterthiner , Simon Kornblith , Chiyuan Zhang , Alexey Dosovitskiy

On the Relationship between Self-Attention and Convolutional Layers

Recent trends of incorporating attention mechanisms in vision have led researchers to reconsider the supremacy of convolutional layers as a primary building block. Beyond helping CNNs to handle long-range dependencies, Ramachandran et al.…

Machine Learning · Computer Science 2020-01-13 Jean-Baptiste Cordonnier , Andreas Loukas , Martin Jaggi

Convolutional Xformers for Vision

Vision transformers (ViTs) have found only limited practical use in processing images, in spite of their state-of-the-art accuracy on certain benchmarks. The reason for their limited use include their need for larger training datasets and…

Computer Vision and Pattern Recognition · Computer Science 2022-01-26 Pranav Jeevan , Amit sethi

Surface Analysis with Vision Transformers

The extension of convolutional neural networks (CNNs) to non-Euclidean geometries has led to multiple frameworks for studying manifolds. Many of those methods have shown design limitations resulting in poor modelling of long-range…

Computer Vision and Pattern Recognition · Computer Science 2022-06-01 Simon Dahan , Logan Z. J. Williams , Abdulah Fawaz , Daniel Rueckert , Emma C. Robinson

ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-scale Stages

Vision Transformers (ViTs) have revolutionized computer vision by leveraging self-attention to model long-range dependencies. However, ViTs face challenges such as high computational costs due to the quadratic scaling of self-attention and…

Computer Vision and Pattern Recognition · Computer Science 2025-04-22 Zhoujie Qian

Intriguing Properties of Vision Transformers

Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems. These models are based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode…

Computer Vision and Pattern Recognition · Computer Science 2021-11-29 Muzammal Naseer , Kanchana Ranasinghe , Salman Khan , Munawar Hayat , Fahad Shahbaz Khan , Ming-Hsuan Yang

A Close Look at Spatial Modeling: From Attention to Convolution

Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. By revisiting the self-attention responses in Transformers, we empirically observe two…

Computer Vision and Pattern Recognition · Computer Science 2022-12-27 Xu Ma , Huan Wang , Can Qin , Kunpeng Li , Xingchen Zhao , Jie Fu , Yun Fu

Vision Transformer: Vit and its Derivatives

Transformer, an attention-based encoder-decoder architecture, has not only revolutionized the field of natural language processing (NLP), but has also done some pioneering work in the field of computer vision (CV). Compared to convolutional…

Computer Vision and Pattern Recognition · Computer Science 2022-05-25 Zujun Fu

MSCViT: A Small-size ViT architecture with Multi-Scale Self-Attention Mechanism for Tiny Datasets

Vision Transformer (ViT) has demonstrated significant potential in various vision tasks due to its strong ability in modelling long-range dependencies. However, such success is largely fueled by training on massive samples. In real…

Computer Vision and Pattern Recognition · Computer Science 2025-01-15 Bowei Zhang , Yi Zhang

CvT: Introducing Convolutions to Vision Transformers

We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. This is…

Computer Vision and Pattern Recognition · Computer Science 2021-03-30 Haiping Wu , Bin Xiao , Noel Codella , Mengchen Liu , Xiyang Dai , Lu Yuan , Lei Zhang

Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work

Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tasks, compare to Convolutional Neural Networks (CNNs). As a demanding technique in computer vision, ViTs have been successfully solved various…

Computer Vision and Pattern Recognition · Computer Science 2023-10-18 Khawar Islam

ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

Convolutional architectures have proven extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision Transformers (ViTs) rely on…

Computer Vision and Pattern Recognition · Computer Science 2022-12-07 Stéphane d'Ascoli , Hugo Touvron , Matthew Leavitt , Ari Morcos , Giulio Biroli , Levent Sagun

DeepViT: Towards Deeper Vision Transformer

Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the…

Computer Vision and Pattern Recognition · Computer Science 2021-04-20 Daquan Zhou , Bingyi Kang , Xiaojie Jin , Linjie Yang , Xiaochen Lian , Zihang Jiang , Qibin Hou , Jiashi Feng

Are Convolutional Neural Networks or Transformers more like human vision?

Modern machine learning models for computer vision exceed humans in accuracy on specific visual recognition tasks, notably on datasets like ImageNet. However, high accuracy can be achieved in many ways. The particular decision function…

Computer Vision and Pattern Recognition · Computer Science 2021-07-02 Shikhar Tuli , Ishita Dasgupta , Erin Grant , Thomas L. Griffiths

Less is More: Pay Less Attention in Vision Transformers

Transformers have become one of the dominant architectures in deep learning, particularly as a powerful alternative to convolutional neural networks (CNNs) in computer vision. However, Transformer training and inference in previous works…

Computer Vision and Pattern Recognition · Computer Science 2021-12-24 Zizheng Pan , Bohan Zhuang , Haoyu He , Jing Liu , Jianfei Cai

What do Vision Transformers Learn? A Visual Exploration

Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision, yet we understand very little about why they work and what they learn. While existing studies visually analyze the mechanisms of convolutional…

Computer Vision and Pattern Recognition · Computer Science 2022-12-14 Amin Ghiasi , Hamid Kazemi , Eitan Borgnia , Steven Reich , Manli Shu , Micah Goldblum , Andrew Gordon Wilson , Tom Goldstein

Vision Transformers provably learn spatial structure

Vision Transformers (ViTs) have achieved comparable or superior performance than Convolutional Neural Networks (CNNs) in computer vision. This empirical breakthrough is even more remarkable since, in contrast to CNNs, ViTs do not embed any…

Computer Vision and Pattern Recognition · Computer Science 2022-10-18 Samy Jelassi , Michael E. Sander , Yuanzhi Li

A Lightweight Convolution and Vision Transformer integrated model with Multi-scale Self-attention Mechanism

Vision Transformer (ViT) has prevailed in computer vision tasks due to its strong long-range dependency modelling ability. \textcolor{blue}{However, its large model size and weak local feature modeling ability hinder its application in real…

Computer Vision and Pattern Recognition · Computer Science 2025-09-12 Yi Zhang , Lingxiao Wei , Bowei Zhang , Ziwei Liu , Kai Yi , Shu Hu