Related papers: Toward Transformer-Based Object Detection

Visual Transformer for Object Detection

Convolutional Neural networks (CNN) have been the first choice of paradigm in many computer vision applications. The convolution operation however has a significant weakness which is it only operates on a local neighborhood of pixels, thus…

Computer Vision and Pattern Recognition · Computer Science 2022-06-14 Michael Yang

Exploring Plain Vision Transformer Backbones for Object Detection

We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical…

Computer Vision and Pattern Recognition · Computer Science 2022-06-13 Yanghao Li , Hanzi Mao , Ross Girshick , Kaiming He

Benchmarking Detection Transfer Learning with Vision Transformers

Object detection is a central downstream task used to test if pre-trained network parameters confer benefits, such as improved accuracy or training speed. The complexity of object detection methods can make this benchmarking non-trivial…

Computer Vision and Pattern Recognition · Computer Science 2021-11-23 Yanghao Li , Saining Xie , Xinlei Chen , Piotr Dollar , Kaiming He , Ross Girshick

A Survey on Visual Transformer

Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representation capabilities, researchers are looking at ways to…

Computer Vision and Pattern Recognition · Computer Science 2023-07-11 Kai Han , Yunhe Wang , Hanting Chen , Xinghao Chen , Jianyuan Guo , Zhenhua Liu , Yehui Tang , An Xiao , Chunjing Xu , Yixing Xu , Zhaohui Yang , Yiman Zhang , Dacheng Tao

TransFG: A Transformer Architecture for Fine-grained Recognition

Fine-grained visual classification (FGVC) which aims at recognizing objects from subcategories is a very challenging task due to the inherently subtle inter-class differences. Most existing works mainly tackle this problem by reusing the…

Computer Vision and Pattern Recognition · Computer Science 2021-12-03 Ju He , Jie-Neng Chen , Shuai Liu , Adam Kortylewski , Cheng Yang , Yutong Bai , Changhu Wang

Vision Transformer with Convolutions Architecture Search

Transformers exhibit great advantages in handling computer vision tasks. They model image classification tasks by utilizing a multi-head attention mechanism to process a series of patches consisting of split images. However, for complex…

Computer Vision and Pattern Recognition · Computer Science 2022-03-22 Haichao Zhang , Kuangrong Hao , Witold Pedrycz , Lei Gao , Xuesong Tang , Bing Wei

An Extendable, Efficient and Effective Transformer-based Object Detector

Transformers have been widely used in numerous vision problems especially for visual recognition and detection. Detection transformers are the first fully end-to-end learning systems for object detection, while vision transformers are the…

Computer Vision and Pattern Recognition · Computer Science 2022-04-19 Hwanjun Song , Deqing Sun , Sanghyuk Chun , Varun Jampani , Dongyoon Han , Byeongho Heo , Wonjae Kim , Ming-Hsuan Yang

VcT: Visual change Transformer for Remote Sensing Image Change Detection

Existing visual change detectors usually adopt CNNs or Transformers for feature representation learning and focus on learning effective representation for the changed regions between images. Although good performance can be obtained by…

Computer Vision and Pattern Recognition · Computer Science 2023-10-18 Bo Jiang , Zitian Wang , Xixi Wang , Ziyan Zhang , Lan Chen , Xiao Wang , Bin Luo

ViDT: An Efficient and Effective Fully Transformer-based Object Detector

Transformers are transforming the landscape of computer vision, especially for recognition tasks. Detection transformers are the first fully end-to-end learning systems for object detection, while vision transformers are the first fully…

Computer Vision and Pattern Recognition · Computer Science 2021-11-30 Hwanjun Song , Deqing Sun , Sanghyuk Chun , Varun Jampani , Dongyoon Han , Byeongho Heo , Wonjae Kim , Ming-Hsuan Yang

A Comprehensive Study of Vision Transformers on Dense Prediction Tasks

Convolutional Neural Networks (CNNs), architectures consisting of convolutional layers, have been the standard choice in vision tasks. Recent studies have shown that Vision Transformers (VTs), architectures based on self-attention modules,…

Computer Vision and Pattern Recognition · Computer Science 2022-01-24 Kishaan Jeeveswaran , Senthilkumar Kathiresan , Arnav Varma , Omar Magdy , Bahram Zonooz , Elahe Arani

A Survey of Visual Transformers

Transformer, an attention-based encoder-decoder model, has already revolutionized the field of natural language processing (NLP). Inspired by such significant achievements, some pioneering works have recently been done on employing…

Computer Vision and Pattern Recognition · Computer Science 2022-12-07 Yang Liu , Yao Zhang , Yixin Wang , Feng Hou , Jin Yuan , Jiang Tian , Yang Zhang , Zhongchao Shi , Jianping Fan , Zhiqiang He

Hands-on Evaluation of Visual Transformers for Object Recognition and Detection

Convolutional Neural Networks (CNNs) for computer vision sometimes struggle with understanding images in a global context, as they mainly focus on local patterns. On the other hand, Vision Transformers (ViTs), inspired by models originally…

Computer Vision and Pattern Recognition · Computer Science 2025-12-11 Dimitrios N. Vlachogiannis , Dimitrios A. Koutsomitropoulos

Do Vision Transformers See Like Convolutional Neural Networks?

Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This…

Computer Vision and Pattern Recognition · Computer Science 2022-03-07 Maithra Raghu , Thomas Unterthiner , Simon Kornblith , Chiyuan Zhang , Alexey Dosovitskiy

Feature Fusion Vision Transformer for Fine-Grained Visual Categorization

The core for tackling the fine-grained visual categorization (FGVC) is to learn subtle yet discriminative features. Most previous works achieve this by explicitly selecting the discriminative parts or integrating the attention mechanism via…

Computer Vision and Pattern Recognition · Computer Science 2022-03-02 Jun Wang , Xiaohan Yu , Yongsheng Gao

A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation

Vision-transformers (ViTs) and large-scale convolution-neural-networks (CNNs) have reshaped computer vision through pretrained feature representations that enable strong transfer learning for diverse tasks. However, their efficiency as…

Computer Vision and Pattern Recognition · Computer Science 2025-10-07 Alon Kaya , Igal Bilik , Inna Stainvas

ChangeViT: Unleashing Plain Vision Transformers for Change Detection

Change detection in remote sensing images is essential for tracking environmental changes on the Earth's surface. Despite the success of vision transformers (ViTs) as backbones in numerous computer vision applications, they remain…

Computer Vision and Pattern Recognition · Computer Science 2024-06-19 Duowang Zhu , Xiaohu Huang , Haiyan Huang , Zhenfeng Shao , Qimin Cheng

ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

Recently, several Vision Transformer (ViT) based methods have been proposed for Fine-Grained Visual Classification (FGVC).These methods significantly surpass existing CNN-based ones, demonstrating the effectiveness of ViT in FGVC…

Computer Vision and Pattern Recognition · Computer Science 2022-03-25 Zi-Chao Zhang , Zhen-Duo Chen , Yongxin Wang , Xin Luo , Xin-Shun Xu

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional…

Computer Vision and Pattern Recognition · Computer Science 2021-06-04 Alexey Dosovitskiy , Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner , Mostafa Dehghani , Matthias Minderer , Georg Heigold , Sylvain Gelly , Jakob Uszkoreit , Neil Houlsby

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed…

Computer Vision and Pattern Recognition · Computer Science 2021-12-01 Li Yuan , Yunpeng Chen , Tao Wang , Weihao Yu , Yujun Shi , Zihang Jiang , Francis EH Tay , Jiashi Feng , Shuicheng Yan

FakeFormer: Efficient Vulnerability-Driven Transformers for Generalisable Deepfake Detection

Recently, Vision Transformers (ViTs) have achieved unprecedented effectiveness in the general domain of image classification. Nonetheless, these models remain underexplored in the field of deepfake detection, given their lower performance…

Computer Vision and Pattern Recognition · Computer Science 2024-11-26 Dat Nguyen , Marcella Astrid , Enjie Ghorbel , Djamila Aouada