Related papers: Vision Transformer with Progressive Sampling

Vision Transformers with Natural Language Semantics

Tokens or patches within Vision Transformers (ViT) lack essential semantic information, unlike their counterparts in natural language processing (NLP). Typically, ViT tokens are associated with rectangular image patches that lack specific…

Computer Vision and Pattern Recognition · Computer Science 2024-02-29 Young Kyung Kim , J. Matías Di Martino , Guillermo Sapiro

Vision Transformers: From Semantic Segmentation to Dense Prediction

The emergence of vision transformers (ViTs) in image classification has shifted the methodologies for visual representation learning. In particular, ViTs learn visual representation at full receptive field per layer across all the image…

Computer Vision and Pattern Recognition · Computer Science 2024-08-05 Li Zhang , Jiachen Lu , Sixiao Zheng , Xinxuan Zhao , Xiatian Zhu , Yanwei Fu , Tao Xiang , Jianfeng Feng , Philip H. S. Torr

Making Vision Transformers Efficient from A Token Sparsification View

The quadratic computational complexity to the number of tokens limits the practical applications of Vision Transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Shuning Chang , Pichao Wang , Ming Lin , Fan Wang , David Junhao Zhang , Rong Jin , Mike Zheng Shou

RegionViT: Regional-to-Local Attention for Vision Transformers

Vision transformer (ViT) has recently shown its strong capability in achieving comparable results to convolutional neural networks (CNNs) on image classification. However, vanilla ViT simply inherits the same architecture from the natural…

Computer Vision and Pattern Recognition · Computer Science 2022-04-01 Chun-Fu Chen , Rameswar Panda , Quanfu Fan

Vision Transformer for Contrastive Clustering

Vision Transformer (ViT) has shown its advantages over the convolutional neural network (CNN) with its ability to capture global long-range dependencies for visual representation learning. Besides ViT, contrastive learning is another…

Computer Vision and Pattern Recognition · Computer Science 2022-07-12 Hua-Bao Ling , Bowen Zhu , Dong Huang , Ding-Hua Chen , Chang-Dong Wang , Jian-Huang Lai

Scalable Vision Transformers with Hierarchical Pooling

The recently proposed Visual image Transformers (ViT) with pure attention have achieved promising performance on image recognition tasks, such as image classification. However, the routine of the current ViT model is to maintain a…

Computer Vision and Pattern Recognition · Computer Science 2021-08-19 Zizheng Pan , Bohan Zhuang , Jing Liu , Haoyu He , Jianfei Cai

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed…

Computer Vision and Pattern Recognition · Computer Science 2021-12-01 Li Yuan , Yunpeng Chen , Tao Wang , Weihao Yu , Yujun Shi , Zihang Jiang , Francis EH Tay , Jiashi Feng , Shuicheng Yan

Representation Separation for Semantic Segmentation with Vision Transformers

Vision transformers (ViTs) encoding an image as a sequence of patches bring new paradigms for semantic segmentation.We present an efficient framework of representation separation in local-patch level and global-region level for semantic…

Computer Vision and Pattern Recognition · Computer Science 2024-10-28 Yuanduo Hong , Huihui Pan , Weichao Sun , Xinghu Yu , Huijun Gao

Boosting vision transformers for image retrieval

Vision transformers have achieved remarkable progress in vision tasks such as image classification and detection. However, in instance-level image retrieval, transformers have not yet shown good performance compared to convolutional…

Computer Vision and Pattern Recognition · Computer Science 2022-10-24 Chull Hwan Song , Jooyoung Yoon , Shunghyun Choi , Yannis Avrithis

Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

The Vision Transformer (ViT) leverages the Transformer's encoder to capture global information by dividing images into patches and achieves superior performance across various computer vision tasks. However, the self-attention mechanism of…

Computer Vision and Pattern Recognition · Computer Science 2025-01-17 Tianxiao Zhang , Wenju Xu , Bo Luo , Guanghui Wang

Vision Transformers are Robust Learners

Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art…

Computer Vision and Pattern Recognition · Computer Science 2021-12-07 Sayak Paul , Pin-Yu Chen

Discrete Representations Strengthen Vision Transformer Robustness

Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition. While recent studies suggest that ViTs are more robust than their convolutional counterparts, our experiments find that ViTs trained on…

Computer Vision and Pattern Recognition · Computer Science 2022-04-05 Chengzhi Mao , Lu Jiang , Mostafa Dehghani , Carl Vondrick , Rahul Sukthankar , Irfan Essa

TransFG: A Transformer Architecture for Fine-grained Recognition

Fine-grained visual classification (FGVC) which aims at recognizing objects from subcategories is a very challenging task due to the inherently subtle inter-class differences. Most existing works mainly tackle this problem by reusing the…

Computer Vision and Pattern Recognition · Computer Science 2021-12-03 Ju He , Jie-Neng Chen , Shuai Liu , Adam Kortylewski , Cheng Yang , Yutong Bai , Changhu Wang

Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words

Vision Transformer (ViT) has emerged as a powerful architecture in the realm of modern computer vision. However, its application in certain imaging fields, such as microscopy and satellite imaging, presents unique challenges. In these…

Computer Vision and Pattern Recognition · Computer Science 2024-04-22 Yujia Bao , Srinivasan Sivanandan , Theofanis Karaletsos

Vision Transformer with Super Token Sampling

Vision transformer has achieved impressive performance for many vision tasks. However, it may suffer from high redundancy in capturing local features for shallow layers. Local self-attention or early-stage convolutions are thus utilized,…

Computer Vision and Pattern Recognition · Computer Science 2024-01-26 Huaibo Huang , Xiaoqiang Zhou , Jie Cao , Ran He , Tieniu Tan

Proximal Vision Transformer: Enhancing Feature Representation through Two-Stage Manifold Geometry

The Vision Transformer (ViT) architecture has become widely recognized in computer vision, leveraging its self-attention mechanism to achieve remarkable success across various tasks. Despite its strengths, ViT's optimization remains…

Computer Vision and Pattern Recognition · Computer Science 2025-08-26 Haoyu Yun , Hamid Krim

CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction

Vision transformer (ViT) has achieved competitive accuracy on a variety of computer vision applications, but its computational cost impedes the deployment on resource-limited mobile devices. We explore the sparsity in ViT and observe that…

Computer Vision and Pattern Recognition · Computer Science 2022-03-10 Zhuoran Song , Yihong Xu , Zhezhi He , Li Jiang , Naifeng Jing , Xiaoyao Liang

A Spitting Image: Modular Superpixel Tokenization in Vision Transformers

Vision Transformer (ViT) architectures traditionally employ a grid-based approach to tokenization independent of the semantic content of an image. We propose a modular superpixel tokenization strategy which decouples tokenization and…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Marius Aasan , Odd Kolbjørnsen , Anne Schistad Solberg , Adín Ramirez Rivera

Improving Vision Transformers for Incremental Learning

This paper proposes a working recipe of using Vision Transformer (ViT) in class incremental learning. Although this recipe only combines existing techniques, developing the combination is not trivial. Firstly, naive application of ViT to…

Computer Vision and Pattern Recognition · Computer Science 2022-04-19 Pei Yu , Yinpeng Chen , Ying Jin , Zicheng Liu

Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning

Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for computer vision tasks, while the self-attention computation in Transformer scales quadratically w.r.t. the input patch number. Thus, existing solutions commonly…

Computer Vision and Pattern Recognition · Computer Science 2022-07-12 Ting Yao , Yingwei Pan , Yehao Li , Chong-Wah Ngo , Tao Mei