Related papers: Visual Transformers: Token-based Image Representat…

Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Transformers, a groundbreaking architecture proposed for Natural Language Processing (NLP), have also achieved remarkable success in Computer Vision. A cornerstone of their success lies in the attention mechanism, which models relationships…

Computer Vision and Pattern Recognition · Computer Science 2025-03-26 Jaihyun Lew , Soohyuk Jang , Jaehoon Lee , Seungryong Yoo , Eunji Kim , Saehyung Lee , Jisoo Mok , Siwon Kim , Sungroh Yoon

TCFormer: Visual Recognition via Token Clustering Transformer

Transformers are widely used in computer vision areas and have achieved remarkable success. Most state-of-the-art approaches split images into regular grids and represent each grid region with a vision token. However, fixed token…

Computer Vision and Pattern Recognition · Computer Science 2024-07-17 Wang Zeng , Sheng Jin , Lumin Xu , Wentao Liu , Chen Qian , Wanli Ouyang , Ping Luo , Xiaogang Wang

Vision Transformers with Mixed-Resolution Tokenization

Vision Transformer models process input images by dividing them into a spatially regular grid of equal-size patches. Conversely, Transformers were originally introduced over natural language sequences, where each token represents a subword…

Computer Vision and Pattern Recognition · Computer Science 2023-04-28 Tomer Ronen , Omer Levy , Avram Golbert

Do Vision Transformers See Like Convolutional Neural Networks?

Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This…

Computer Vision and Pattern Recognition · Computer Science 2022-03-07 Maithra Raghu , Thomas Unterthiner , Simon Kornblith , Chiyuan Zhang , Alexey Dosovitskiy

Vision Transformers with Natural Language Semantics

Tokens or patches within Vision Transformers (ViT) lack essential semantic information, unlike their counterparts in natural language processing (NLP). Typically, ViT tokens are associated with rectangular image patches that lack specific…

Computer Vision and Pattern Recognition · Computer Science 2024-02-29 Young Kyung Kim , J. Matías Di Martino , Guillermo Sapiro

Patch Is Not All You Need

Vision Transformers have achieved great success in computer visions, delivering exceptional performance across various tasks. However, their inherent reliance on sequential input enforces the manual partitioning of images into patch…

Computer Vision and Pattern Recognition · Computer Science 2023-08-22 Changzhen Li , Jie Zhang , Yang Wei , Zhilong Ji , Jinfeng Bai , Shiguang Shan

Toward Transformer-Based Object Detection

Transformers have become the dominant model in natural language processing, owing to their ability to pretrain on massive amounts of data, then transfer to smaller, more specific tasks via fine-tuning. The Vision Transformer was the first…

Computer Vision and Pattern Recognition · Computer Science 2020-12-21 Josh Beal , Eric Kim , Eric Tzeng , Dong Huk Park , Andrew Zhai , Dmitry Kislyuk

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed…

Computer Vision and Pattern Recognition · Computer Science 2021-12-01 Li Yuan , Yunpeng Chen , Tao Wang , Weihao Yu , Yujun Shi , Zihang Jiang , Francis EH Tay , Jiashi Feng , Shuicheng Yan

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional…

Computer Vision and Pattern Recognition · Computer Science 2021-06-04 Alexey Dosovitskiy , Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner , Mostafa Dehghani , Matthias Minderer , Georg Heigold , Sylvain Gelly , Jakob Uszkoreit , Neil Houlsby

Vision Conformer: Incorporating Convolutions into Vision Transformer Layers

Transformers are popular neural network models that use layers of self-attention and fully-connected nodes with embedded tokens. Vision Transformers (ViT) adapt transformers for image recognition tasks. In order to do this, the images are…

Computer Vision and Pattern Recognition · Computer Science 2023-04-28 Brian Kenji Iwana , Akihiro Kusuda

Making Vision Transformers Efficient from A Token Sparsification View

The quadratic computational complexity to the number of tokens limits the practical applications of Vision Transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Shuning Chang , Pichao Wang , Ming Lin , Fan Wang , David Junhao Zhang , Rong Jin , Mike Zheng Shou

Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning

Vision transformers have established a precedent of patchifying images into uniformly-sized chunks before processing. We hypothesize that this design choice may limit models in learning comprehensive and compositional representations from…

Computer Vision and Pattern Recognition · Computer Science 2025-05-20 Neha Kalibhat , Priyatham Kattakinda , Sumit Nawathe , Arman Zarei , Nikita Seleznev , Samuel Sharpe , Senthil Kumar , Soheil Feizi

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition. They split every 2D image into a fixed number of patches, each of which is treated as a token. Generally, representing an image with more tokens…

Computer Vision and Pattern Recognition · Computer Science 2021-10-27 Yulin Wang , Rui Huang , Shiji Song , Zeyi Huang , Gao Huang

Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

Vision transformers have achieved great successes in many computer vision tasks. Most methods generate vision tokens by splitting an image into a regular and fixed grid and treating each cell as a token. However, not all regions are equally…

Computer Vision and Pattern Recognition · Computer Science 2022-04-22 Wang Zeng , Sheng Jin , Wentao Liu , Chen Qian , Ping Luo , Wanli Ouyang , Xiaogang Wang

CMT: Convolutional Neural Networks Meet Vision Transformers

Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image. However, there are still gaps in both performance and computational cost between…

Computer Vision and Pattern Recognition · Computer Science 2022-06-15 Jianyuan Guo , Kai Han , Han Wu , Yehui Tang , Xinghao Chen , Yunhe Wang , Chang Xu

Vision Transformers Need Better Token Interaction

Vision Transformers (ViTs) can learn strong image-level representations while their patch representations become less effective for dense prediction during prolonged training. We revisit this dense degradation phenomenon and argue that it…

Computer Vision and Pattern Recognition · Computer Science 2026-05-25 Linxiang Su

Vision Transformer with Progressive Sampling

Transformers with powerful global relation modeling abilities have been introduced to fundamental computer vision tasks recently. As a typical example, the Vision Transformer (ViT) directly applies a pure transformer architecture on image…

Computer Vision and Pattern Recognition · Computer Science 2021-08-05 Xiaoyu Yue , Shuyang Sun , Zhanghui Kuang , Meng Wei , Philip Torr , Wayne Zhang , Dahua Lin

Boosting vision transformers for image retrieval

Vision transformers have achieved remarkable progress in vision tasks such as image classification and detection. However, in instance-level image retrieval, transformers have not yet shown good performance compared to convolutional…

Computer Vision and Pattern Recognition · Computer Science 2022-10-24 Chull Hwan Song , Jooyoung Yoon , Shunghyun Choi , Yannis Avrithis

Vision Transformers: From Semantic Segmentation to Dense Prediction

The emergence of vision transformers (ViTs) in image classification has shifted the methodologies for visual representation learning. In particular, ViTs learn visual representation at full receptive field per layer across all the image…

Computer Vision and Pattern Recognition · Computer Science 2024-08-05 Li Zhang , Jiachen Lu , Sixiao Zheng , Xinxuan Zhao , Xiatian Zhu , Yanwei Fu , Tao Xiang , Jianfeng Feng , Philip H. S. Torr

Image Recognition with Online Lightweight Vision Transformer: A Survey

The Transformer architecture has achieved significant success in natural language processing, motivating its adaptation to computer vision tasks. Unlike convolutional neural networks, vision transformers inherently capture long-range…

Computer Vision and Pattern Recognition · Computer Science 2025-09-29 Zherui Zhang , Rongtao Xu , Jie Zhou , Changwei Wang , Xingtian Pei , Wenhao Xu , Jiguang Zhang , Li Guo , Longxiang Gao , Wenbo Xu , Shibiao Xu