Related papers: Dynamic Token Normalization Improves Vision Transf…

DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion

Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in computer vision. Unlike convolutional neural networks (CNNs), ViTs are capable of global information sharing. With the development of…

Computer Vision and Pattern Recognition · Computer Science 2023-09-25 Zhenzhen Chu , Jiayu Chen , Cen Chen , Chengyu Wang , Ziheng Wu , Jun Huang , Weining Qian

Vision Transformers with Natural Language Semantics

Tokens or patches within Vision Transformers (ViT) lack essential semantic information, unlike their counterparts in natural language processing (NLP). Typically, ViT tokens are associated with rectangular image patches that lack specific…

Computer Vision and Pattern Recognition · Computer Science 2024-02-29 Young Kyung Kim , J. Matías Di Martino , Guillermo Sapiro

Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets

There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets, which is concluded to the lack of inductive bias. In this paper, we…

Computer Vision and Pattern Recognition · Computer Science 2023-01-02 Zhiying Lu , Hongtao Xie , Chuanbin Liu , Yongdong Zhang

DaViT: Dual Attention Vision Transformers

In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the…

Computer Vision and Pattern Recognition · Computer Science 2022-04-08 Mingyu Ding , Bin Xiao , Noel Codella , Ping Luo , Jingdong Wang , Lu Yuan

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed…

Computer Vision and Pattern Recognition · Computer Science 2021-12-01 Li Yuan , Yunpeng Chen , Tao Wang , Weihao Yu , Yujun Shi , Zihang Jiang , Francis EH Tay , Jiashi Feng , Shuicheng Yan

TFS-ViT: Token-Level Feature Stylization for Domain Generalization

Standard deep learning models such as convolutional neural networks (CNNs) lack the ability of generalizing to domains which have not been seen during training. This problem is mainly due to the common but often wrong assumption of such…

Computer Vision and Pattern Recognition · Computer Science 2024-03-19 Mehrdad Noori , Milad Cheraghalikhani , Ali Bahri , Gustavo A. Vargas Hakim , David Osowiechi , Ismail Ben Ayed , Christian Desrosiers

Optimizing Vision Transformers with Data-Free Knowledge Transfer

The groundbreaking performance of transformers in Natural Language Processing (NLP) tasks has led to their replacement of traditional Convolutional Neural Networks (CNNs), owing to the efficiency and accuracy achieved through the…

Computer Vision and Pattern Recognition · Computer Science 2024-08-13 Gousia Habib , Damandeep Singh , Ishfaq Ahmad Malik , Brejesh Lall

Hands-on Evaluation of Visual Transformers for Object Recognition and Detection

Convolutional Neural Networks (CNNs) for computer vision sometimes struggle with understanding images in a global context, as they mainly focus on local patterns. On the other hand, Vision Transformers (ViTs), inspired by models originally…

Computer Vision and Pattern Recognition · Computer Science 2025-12-11 Dimitrios N. Vlachogiannis , Dimitrios A. Koutsomitropoulos

DeepViT: Towards Deeper Vision Transformer

Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the…

Computer Vision and Pattern Recognition · Computer Science 2021-04-20 Daquan Zhou , Bingyi Kang , Xiaojie Jin , Linjie Yang , Xiaochen Lian , Zihang Jiang , Qibin Hou , Jiashi Feng

Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Transformers, a groundbreaking architecture proposed for Natural Language Processing (NLP), have also achieved remarkable success in Computer Vision. A cornerstone of their success lies in the attention mechanism, which models relationships…

Computer Vision and Pattern Recognition · Computer Science 2025-03-26 Jaihyun Lew , Soohyuk Jang , Jaehoon Lee , Seungryong Yoo , Eunji Kim , Saehyung Lee , Jisoo Mok , Siwon Kim , Sungroh Yoon

MVFormer: Diversifying Feature Normalization and Token Mixing for Efficient Vision Transformers

Active research is currently underway to enhance the efficiency of vision transformers (ViTs). Most studies have focused solely on effective token mixers, overlooking the potential relationship with normalization. To boost diverse feature…

Computer Vision and Pattern Recognition · Computer Science 2024-12-02 Jongseong Bae , Susang Kim , Minsu Cho , Ha Young Kim

Morphing Tokens Draw Strong Masked Image Models

Masked image modeling (MIM) has emerged as a promising approach for pre-training Vision Transformers (ViTs). MIMs predict masked tokens token-wise to recover target signals that are tokenized from images or generated by pre-trained models…

Computer Vision and Pattern Recognition · Computer Science 2025-03-24 Taekyung Kim , Byeongho Heo , Dongyoon Han

Make A Long Image Short: Adaptive Token Length for Vision Transformers

The vision transformer is a model that breaks down each image into a sequence of tokens with a fixed length and processes them similarly to words in natural language processing. Although increasing the number of tokens typically results in…

Machine Learning · Computer Science 2023-07-06 Qiqi Zhou , Yichen Zhu

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition. They split every 2D image into a fixed number of patches, each of which is treated as a token. Generally, representing an image with more tokens…

Computer Vision and Pattern Recognition · Computer Science 2021-10-27 Yulin Wang , Rui Huang , Shiji Song , Zeyi Huang , Gao Huang

Video Transformer Network

This paper presents VTN, a transformer-based framework for video recognition. Inspired by recent developments in vision transformers, we ditch the standard approach in video action recognition that relies on 3D ConvNets and introduce a…

Computer Vision and Pattern Recognition · Computer Science 2021-08-18 Daniel Neimark , Omri Bar , Maya Zohar , Dotan Asselmann

MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers

The input tokens to Vision Transformers carry little semantic meaning as they are defined as regular equal-sized patches of the input image, regardless of its content. However, processing uniform background areas of an image should not…

Computer Vision and Pattern Recognition · Computer Science 2023-09-08 Jakob Drachmann Havtorn , Amelie Royer , Tijmen Blankevoort , Babak Ehteshami Bejnordi

Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

The Vision Transformer (ViT) leverages the Transformer's encoder to capture global information by dividing images into patches and achieves superior performance across various computer vision tasks. However, the self-attention mechanism of…

Computer Vision and Pattern Recognition · Computer Science 2025-01-17 Tianxiao Zhang , Wenju Xu , Bo Luo , Guanghui Wang

Local Context Normalization: Revisiting Local Normalization

Normalization layers have been shown to improve convergence in deep neural networks, and even add useful inductive biases. In many vision applications the local spatial context of the features is important, but most common normalization…

Computer Vision and Pattern Recognition · Computer Science 2020-05-12 Anthony Ortiz , Caleb Robinson , Dan Morris , Olac Fuentes , Christopher Kiekintveld , Md Mahmudulla Hassan , Nebojsa Jojic

Vision Conformer: Incorporating Convolutions into Vision Transformer Layers

Transformers are popular neural network models that use layers of self-attention and fully-connected nodes with embedded tokens. Vision Transformers (ViT) adapt transformers for image recognition tasks. In order to do this, the images are…

Computer Vision and Pattern Recognition · Computer Science 2023-04-28 Brian Kenji Iwana , Akihiro Kusuda

HTR-VT: Handwritten Text Recognition with Vision Transformer

We explore the application of Vision Transformer (ViT) for handwritten text recognition. The limited availability of labeled data in this domain poses challenges for achieving high performance solely relying on ViT. Previous…

Computer Vision and Pattern Recognition · Computer Science 2024-09-16 Yuting Li , Dexiong Chen , Tinglong Tang , Xi Shen