Related papers: Dynamic Grained Encoder for Vision Transformers

Fine-grained recognition involves the classification of images from subordinate macro-categories, and it is challenging due to small inter-class differences. To overcome this, most methods perform discriminative feature selection enabled by…

Computer Vision and Pattern Recognition · Computer Science 2024-07-19 Edwin Arkel Rios , Min-Chun Hu , Bo-Cheng Lai

Transformer-Based Visual Segmentation: A Survey

Visual segmentation seeks to partition images, video frames, or point clouds into multiple segments or groups. This technique has numerous real-world applications, such as autonomous driving, image editing, robot sensing, and medical…

Computer Vision and Pattern Recognition · Computer Science 2024-08-06 Xiangtai Li , Henghui Ding , Haobo Yuan , Wenwei Zhang , Jiangmiao Pang , Guangliang Cheng , Kai Chen , Ziwei Liu , Chen Change Loy

Visual Grounding with Transformers

In this paper, we propose a transformer based approach for visual grounding. Unlike previous proposal-and-rank frameworks that rely heavily on pretrained object detectors or proposal-free frameworks that upgrade an off-the-shelf one-stage…

Computer Vision and Pattern Recognition · Computer Science 2022-03-15 Ye Du , Zehua Fu , Qingjie Liu , Yunhong Wang

TCFormer: Visual Recognition via Token Clustering Transformer

Transformers are widely used in computer vision areas and have achieved remarkable success. Most state-of-the-art approaches split images into regular grids and represent each grid region with a vision token. However, fixed token…

Computer Vision and Pattern Recognition · Computer Science 2024-07-17 Wang Zeng , Sheng Jin , Lumin Xu , Wentao Liu , Chen Qian , Wanli Ouyang , Ping Luo , Xiaogang Wang

Vision Transformer with Progressive Sampling

Transformers with powerful global relation modeling abilities have been introduced to fundamental computer vision tasks recently. As a typical example, the Vision Transformer (ViT) directly applies a pure transformer architecture on image…

Computer Vision and Pattern Recognition · Computer Science 2021-08-05 Xiaoyu Yue , Shuyang Sun , Zhanghui Kuang , Meng Wei , Philip Torr , Wayne Zhang , Dahua Lin

A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis

We present a novel usage of Transformers to make image classification interpretable. Unlike mainstream classifiers that wait until the last fully connected layer to incorporate class information to make predictions, we investigate a…

Computer Vision and Pattern Recognition · Computer Science 2024-06-17 Dipanjyoti Paul , Arpita Chowdhury , Xinqi Xiong , Feng-Ju Chang , David Carlyn , Samuel Stevens , Kaiya L. Provost , Anuj Karpatne , Bryan Carstens , Daniel Rubenstein , Charles Stewart , Tanya Berger-Wolf , Yu Su , Wei-Lun Chao

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

Transformers have revolutionized computer vision and natural language processing, but their high computational complexity limits their application in high-resolution image processing and long-context analysis. This paper introduces…

Computer Vision and Pattern Recognition · Computer Science 2025-04-01 Yuchen Duan , Weiyun Wang , Zhe Chen , Xizhou Zhu , Lewei Lu , Tong Lu , Yu Qiao , Hongsheng Li , Jifeng Dai , Wenhai Wang

Native Segmentation Vision Transformers

Uniform downsampling remains the de facto standard for reducing spatial resolution in vision backbones. In this work, we propose an alternative design built around a content-aware spatial grouping layer, that dynamically assigns tokens to a…

Computer Vision and Pattern Recognition · Computer Science 2025-05-23 Guillem Brasó , Aljoša Ošep , Laura Leal-Taixé

An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding

Most advanced visual grounding methods rely on Transformers for visual-linguistic feature fusion. However, these Transformer-based approaches encounter a significant drawback: the computational costs escalate quadratically due to the…

Computer Vision and Pattern Recognition · Computer Science 2024-08-05 Wei Chen , Long Chen , Yu Wu

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Referring image segmentation is a fundamental vision-language task that aims to segment out an object referred to by a natural language expression from an image. One of the key challenges behind this task is leveraging the referring…

Computer Vision and Pattern Recognition · Computer Science 2022-04-07 Zhao Yang , Jiaqi Wang , Yansong Tang , Kai Chen , Hengshuang Zhao , Philip H. S. Torr

A Comprehensive Study of Vision Transformers in Image Classification Tasks

Image Classification is a fundamental task in the field of computer vision that frequently serves as a benchmark for gauging advancements in Computer Vision. Over the past few years, significant progress has been made in image…

Computer Vision and Pattern Recognition · Computer Science 2023-12-06 Mahmoud Khalil , Ahmad Khalil , Alioune Ngom

Reversible Vision Transformers

We present Reversible Vision Transformers, a memory efficient architecture design for visual recognition. By decoupling the GPU memory requirement from the depth of the model, Reversible Vision Transformers enable scaling up architectures…

Computer Vision and Pattern Recognition · Computer Science 2023-02-10 Karttikeya Mangalam , Haoqi Fan , Yanghao Li , Chao-Yuan Wu , Bo Xiong , Christoph Feichtenhofer , Jitendra Malik

Vision Transformers for Dense Prediction

We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into…

Computer Vision and Pattern Recognition · Computer Science 2021-03-26 René Ranftl , Alexey Bochkovskiy , Vladlen Koltun

UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery

Semantic segmentation of remotely sensed urban scene images is required in a wide range of practical applications, such as land cover mapping, urban change detection, environmental protection, and economic assessment.Driven by rapid…

Computer Vision and Pattern Recognition · Computer Science 2022-06-28 Libo Wang , Rui Li , Ce Zhang , Shenghui Fang , Chenxi Duan , Xiaoliang Meng , Peter M. Atkinson

Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition

Fine-grained action recognition is a challenging task in computer vision. As fine-grained datasets have small inter-class variations in spatial and temporal space, fine-grained action recognition model requires good temporal reasoning and…

Computer Vision and Pattern Recognition · Computer Science 2022-08-04 Mei Chee Leong , Haosong Zhang , Hui Li Tan , Liyuan Li , Joo Hwee Lim

A Survey of Visual Transformers

Transformer, an attention-based encoder-decoder model, has already revolutionized the field of natural language processing (NLP). Inspired by such significant achievements, some pioneering works have recently been done on employing…

Computer Vision and Pattern Recognition · Computer Science 2022-12-07 Yang Liu , Yao Zhang , Yixin Wang , Feng Hou , Jin Yuan , Jiang Tian , Yang Zhang , Zhongchao Shi , Jianping Fan , Zhiqiang He

Dynamic Granularity Matters: Rethinking Vision Transformers Beyond Fixed Patch Splitting

Vision Transformers (ViTs) have demonstrated strong capabilities in capturing global dependencies but often struggle to efficiently represent fine-grained local details. Existing multi-scale approaches alleviate this issue by integrating…

Computer Vision and Pattern Recognition · Computer Science 2025-11-25 Qiyang Yu , Yu Fang , Tianrui Li , Xuemei Cao , Yan Chen , Jianghao Li , Fan Min

EDTformer: An Efficient Decoder Transformer for Visual Place Recognition

Visual place recognition (VPR) aims to determine the general geographical location of a query image by retrieving visually similar images from a large geo-tagged database. To obtain a global representation for each place image, most…

Computer Vision and Pattern Recognition · Computer Science 2025-05-27 Tong Jin , Feng Lu , Shuyu Hu , Chun Yuan , Yunpeng Liu

Image-Specific Adaptation of Transformer Encoders for Compute-Efficient Segmentation

Vision transformer based models bring significant improvements for image segmentation tasks. Although these architectures offer powerful capabilities irrespective of specific segmentation tasks, their use of computational resources can be…

Computer Vision and Pattern Recognition · Computer Science 2026-04-01 Manyi Yao , Abhishek Aich , Yumin Suh , Amit Roy-Chowdhury , Christian Shelton , Manmohan Chandraker

Training Vision Transformers for Image Retrieval

Transformers have shown outstanding results for natural language understanding and, more recently, for image classification. We here extend this work and propose a transformer-based approach for image retrieval: we adopt vision transformers…

Computer Vision and Pattern Recognition · Computer Science 2021-02-11 Alaaeldin El-Nouby , Natalia Neverova , Ivan Laptev , Hervé Jégou