Related papers: M2Former: Multi-Scale Patch Selection for Fine-Gra…

ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

Recently, several Vision Transformer (ViT) based methods have been proposed for Fine-Grained Visual Classification (FGVC).These methods significantly surpass existing CNN-based ones, demonstrating the effectiveness of ViT in FGVC…

Computer Vision and Pattern Recognition · Computer Science 2022-03-25 Zi-Chao Zhang , Zhen-Duo Chen , Yongxin Wang , Xin Luo , Xin-Shun Xu

TransFG: A Transformer Architecture for Fine-grained Recognition

Fine-grained visual classification (FGVC) which aims at recognizing objects from subcategories is a very challenging task due to the inherently subtle inter-class differences. Most existing works mainly tackle this problem by reusing the…

Computer Vision and Pattern Recognition · Computer Science 2021-12-03 Ju He , Jie-Neng Chen , Shuai Liu , Adam Kortylewski , Cheng Yang , Yutong Bai , Changhu Wang

MVSFormer: Multi-View Stereo by Learning Robust Image Features and Temperature-based Depth

Feature representation learning is the key recipe for learning-based Multi-View Stereo (MVS). As the common feature extractor of learning-based MVS, vanilla Feature Pyramid Networks (FPNs) suffer from discouraged feature representations for…

Computer Vision and Pattern Recognition · Computer Science 2022-12-19 Chenjie Cao , Xinlin Ren , Yanwei Fu

Salient Mask-Guided Vision Transformer for Fine-Grained Classification

Fine-grained visual classification (FGVC) is a challenging computer vision problem, where the task is to automatically recognise objects from subordinate categories. One of its main difficulties is capturing the most discriminative…

Computer Vision and Pattern Recognition · Computer Science 2024-01-03 Dmitry Demidov , Muhammad Hamza Sharif , Aliakbar Abdurahimov , Hisham Cholakkal , Fahad Shahbaz Khan

Multimodal Fusion Transformer for Remote Sensing Image Classification

Vision transformers (ViTs) have been trending in image classification tasks due to their promising performance when compared to convolutional neural networks (CNNs). As a result, many researchers have tried to incorporate ViTs in…

Computer Vision and Pattern Recognition · Computer Science 2023-06-21 Swalpa Kumar Roy , Ankur Deria , Danfeng Hong , Behnood Rasti , Antonio Plaza , Jocelyn Chanussot

MPViT: Multi-Path Vision Transformer for Dense Prediction

Dense computer vision tasks such as object detection and segmentation require effective multi-scale feature representation for detecting or classifying objects or regions with varying sizes. While Convolutional Neural Networks (CNNs) have…

Computer Vision and Pattern Recognition · Computer Science 2021-12-28 Youngwan Lee , Jonghee Kim , Jeff Willette , Sung Ju Hwang

A free lunch from ViT:Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition

Learning subtle representation about object parts plays a vital role in fine-grained visual recognition (FGVR) field. The vision transformer (ViT) achieves promising results on computer vision due to its attention mechanism. Nonetheless,…

Computer Vision and Pattern Recognition · Computer Science 2021-10-12 Yuan Zhang , Jian Cao , Ling Zhang , Xiangcheng Liu , Zhiyi Wang , Feng Ling , Weiqian Chen

RAMS-Trans: Recurrent Attention Multi-scale Transformer forFine-grained Image Recognition

In fine-grained image recognition (FGIR), the localization and amplification of region attention is an important factor, which has been explored a lot by convolutional neural networks (CNNs) based approaches. The recently developed vision…

Computer Vision and Pattern Recognition · Computer Science 2021-07-20 Yunqing Hu , Xuan Jin , Yin Zhang , Haiwen Hong , Jingfeng Zhang , Yuan He , Hui Xue

Feature Fusion Vision Transformer for Fine-Grained Visual Categorization

The core for tackling the fine-grained visual categorization (FGVC) is to learn subtle yet discriminative features. Most previous works achieve this by explicitly selecting the discriminative parts or integrating the attention mechanism via…

Computer Vision and Pattern Recognition · Computer Science 2022-03-02 Jun Wang , Xiaohan Yu , Yongsheng Gao

MSPE: Multi-Scale Patch Embedding Prompts Vision Transformers to Any Resolution

Although Vision Transformers (ViTs) have recently advanced computer vision tasks significantly, an important real-world problem was overlooked: adapting to variable input resolutions. Typically, images are resized to a fixed resolution,…

Computer Vision and Pattern Recognition · Computer Science 2024-05-29 Wenzhuo Liu , Fei Zhu , Shijie Ma , Cheng-Lin Liu

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

The recently developed vision transformer (ViT) has achieved promising results on image classification compared to convolutional neural networks. Inspired by this, in this paper, we study how to learn multi-scale feature representations in…

Computer Vision and Pattern Recognition · Computer Science 2021-08-24 Chun-Fu Chen , Quanfu Fan , Rameswar Panda

Dynamic Granularity Matters: Rethinking Vision Transformers Beyond Fixed Patch Splitting

Vision Transformers (ViTs) have demonstrated strong capabilities in capturing global dependencies but often struggle to efficiently represent fine-grained local details. Existing multi-scale approaches alleviate this issue by integrating…

Computer Vision and Pattern Recognition · Computer Science 2025-11-25 Qiyang Yu , Yu Fang , Tianrui Li , Xuemei Cao , Yan Chen , Jianghao Li , Fan Min

Patch-wise Mixed-Precision Quantization of Vision Transformer

As emerging hardware begins to support mixed bit-width arithmetic computation, mixed-precision quantization is widely used to reduce the complexity of neural networks. However, Vision Transformers (ViTs) require complex self-attention…

Computer Vision and Pattern Recognition · Computer Science 2023-05-12 Junrui Xiao , Zhikai Li , Lianwei Yang , Qingyi Gu

MVP: Meta Visual Prompt Tuning for Few-Shot Remote Sensing Image Scene Classification

Vision Transformer (ViT) models have recently emerged as powerful and versatile models for various visual tasks. Recently, a work called PMF has achieved promising results in few-shot image classification by utilizing pre-trained vision…

Computer Vision and Pattern Recognition · Computer Science 2023-09-19 Junjie Zhu , Yiying Li , Chunping Qiu , Ke Yang , Naiyang Guan , Xiaodong Yi

MVT: Multi-view Vision Transformer for 3D Object Recognition

Inspired by the great success achieved by CNN in image recognition, view-based methods applied CNNs to model the projected views for 3D object understanding and achieved excellent performance. Nevertheless, multi-view CNN models cannot…

Computer Vision and Pattern Recognition · Computer Science 2021-10-26 Shuo Chen , Tan Yu , Ping Li

MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers

The input tokens to Vision Transformers carry little semantic meaning as they are defined as regular equal-sized patches of the input image, regardless of its content. However, processing uniform background areas of an image should not…

Computer Vision and Pattern Recognition · Computer Science 2023-09-08 Jakob Drachmann Havtorn , Amelie Royer , Tijmen Blankevoort , Babak Ehteshami Bejnordi

Face Pyramid Vision Transformer

A novel Face Pyramid Vision Transformer (FPVT) is proposed to learn a discriminative multi-scale facial representations for face recognition and verification. In FPVT, Face Spatial Reduction Attention (FSRA) and Dimensionality Reduction…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Khawar Islam , Muhammad Zaigham Zaheer , Arif Mahmood

Fine-Grained Visual Classification via Progressive Multi-Granularity Training of Jigsaw Patches

Fine-grained visual classification (FGVC) is much more challenging than traditional classification tasks due to the inherently subtle intra-class object variations. Recent works mainly tackle this problem by focusing on how to locate the…

Computer Vision and Pattern Recognition · Computer Science 2020-07-21 Ruoyi Du , Dongliang Chang , Ayan Kumar Bhunia , Jiyang Xie , Zhanyu Ma , Yi-Zhe Song , Jun Guo

CF-ViT: A General Coarse-to-Fine Method for Vision Transformer

Vision Transformers (ViT) have made many breakthroughs in computer vision tasks. However, considerable redundancy arises in the spatial dimension of an input image, leading to massive computational costs. Therefore, We propose a…

Computer Vision and Pattern Recognition · Computer Science 2022-11-22 Mengzhao Chen , Mingbao Lin , Ke Li , Yunhang Shen , Yongjian Wu , Fei Chao , Rongrong Ji

Exploring Vision Transformers for Fine-grained Classification

Existing computer vision research in categorization struggles with fine-grained attributes recognition due to the inherently high intra-class variances and low inter-class variances. SOTA methods tackle this challenge by locating the most…

Computer Vision and Pattern Recognition · Computer Science 2021-07-01 Marcos V. Conde , Kerem Turgutlu