Related papers: Efficient Self-supervised Vision Transformers for …

ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-scale Stages

Vision Transformers (ViTs) have revolutionized computer vision by leveraging self-attention to model long-range dependencies. However, ViTs face challenges such as high computational costs due to the quadratic scaling of self-attention and…

Computer Vision and Pattern Recognition · Computer Science 2025-04-22 Zhoujie Qian

Local Masking Meets Progressive Freezing: Crafting Efficient Vision Transformers for Self-Supervised Learning

In this paper, we present an innovative approach to self-supervised learning for Vision Transformers (ViTs), integrating local masked image modeling with progressive layer freezing. This method focuses on enhancing the efficiency and speed…

Computer Vision and Pattern Recognition · Computer Science 2023-12-06 Utku Mert Topcuoglu , Erdem Akagündüz

Super Vision Transformer

We attempt to reduce the computational costs in vision transformers (ViTs), which increase quadratically in the token number. We present a novel training paradigm that trains only one ViT model at a time, but is capable of providing…

Computer Vision and Pattern Recognition · Computer Science 2023-07-20 Mingbao Lin , Mengzhao Chen , Yuxin Zhang , Chunhua Shen , Rongrong Ji , Liujuan Cao

HSViT: Horizontally Scalable Vision Transformer

Due to its deficiency in prior knowledge (inductive bias), Vision Transformer (ViT) requires pre-training on large-scale datasets to perform well. Moreover, the growing layers and parameters in ViT models impede their applicability to…

Computer Vision and Pattern Recognition · Computer Science 2024-07-17 Chenhao Xu , Chang-Tsun Li , Chee Peng Lim , Douglas Creighton

Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training

Learning efficient and expressive visual representation has long been the pursuit of computer vision research. While Vision Transformers (ViTs) gradually replace traditional Convolutional Neural Networks (CNNs) as more scalable vision…

Computer Vision and Pattern Recognition · Computer Science 2026-03-23 Quan Kong , Yanru Xiao , Yuhao Shen , Cong Wang

DeepViT: Towards Deeper Vision Transformer

Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the…

Computer Vision and Pattern Recognition · Computer Science 2021-04-20 Daquan Zhou , Bingyi Kang , Xiaojie Jin , Linjie Yang , Xiaochen Lian , Zihang Jiang , Qibin Hou , Jiashi Feng

Efficient Partitioning Vision Transformer on Edge Devices for Distributed Inference

Deep learning models are increasingly utilized on resource-constrained edge devices for real-time data analytics. Recently, Vision Transformer and their variants have shown exceptional performance in various computer vision tasks. However,…

Computer Vision and Pattern Recognition · Computer Science 2025-05-22 Xiang Liu , Yijun Song , Xia Li , Yifei Sun , Huiying Lan , Zemin Liu , Linshan Jiang , Jialin Li

Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling

In computer vision, Single Image Super-Resolution (SISR) is still a difficult problem. We present ViT-SR, a new technique to improve the performance of a Vision Transformer (ViT) employing a two-stage training strategy. In our method, the…

Computer Vision and Pattern Recognition · Computer Science 2025-12-04 Aditya Chaudhary , Prachet Dev Singh , Ankit Jha

A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking

Vision Transformer (ViT) architectures are becoming increasingly popular and widely employed to tackle computer vision applications. Their main feature is the capacity to extract global information through the self-attention mechanism,…

Computer Vision and Pattern Recognition · Computer Science 2024-05-06 Lorenzo Papa , Paolo Russo , Irene Amerini , Luping Zhou

Searching for Efficient Multi-Stage Vision Transformers

Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applied to computer vision tasks and result in comparable performance to convolutional neural networks (CNN), which have been studied and adopted…

Computer Vision and Pattern Recognition · Computer Science 2021-09-03 Yi-Lun Liao , Sertac Karaman , Vivienne Sze

EFTViT: Efficient Federated Training of Vision Transformers with Masked Images on Resource-Constrained Clients

Federated learning research has recently shifted from Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs) due to their superior capacity. ViTs training demands higher computational resources due to the lack of 2D inductive…

Computer Vision and Pattern Recognition · Computer Science 2025-09-03 Meihan Wu , Tao Chang , Cui Miao , Jie Zhou , Chun Li , Xiangyu Xu , Ming Li , Xiaodong Wang

ReViT: Enhancing Vision Transformers Feature Diversity with Attention Residual Connections

Vision Transformer (ViT) self-attention mechanism is characterized by feature collapse in deeper layers, resulting in the vanishing of low-level visual features. However, such features can be helpful to accurately represent and identify…

Computer Vision and Pattern Recognition · Computer Science 2024-08-06 Anxhelo Diko , Danilo Avola , Marco Cascio , Luigi Cinque

LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly…

Computer Vision and Pattern Recognition · Computer Science 2021-05-07 Ben Graham , Alaaeldin El-Nouby , Hugo Touvron , Pierre Stock , Armand Joulin , Hervé Jégou , Matthijs Douze

Transfer Learning for Fine-grained Classification Using Semi-supervised Learning and Visual Transformers

Fine-grained classification is a challenging task that involves identifying subtle differences between objects within the same category. This task is particularly challenging in scenarios where data is scarce. Visual transformers (ViT) have…

Computer Vision and Pattern Recognition · Computer Science 2023-05-18 Manuel Lagunas , Brayan Impata , Victor Martinez , Virginia Fernandez , Christos Georgakis , Sofia Braun , Felipe Bertrand

Proximal Vision Transformer: Enhancing Feature Representation through Two-Stage Manifold Geometry

The Vision Transformer (ViT) architecture has become widely recognized in computer vision, leveraging its self-attention mechanism to achieve remarkable success across various tasks. Despite its strengths, ViT's optimization remains…

Computer Vision and Pattern Recognition · Computer Science 2025-08-26 Haoyu Yun , Hamid Krim

Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning

Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for computer vision tasks, while the self-attention computation in Transformer scales quadratically w.r.t. the input patch number. Thus, existing solutions commonly…

Computer Vision and Pattern Recognition · Computer Science 2022-07-12 Ting Yao , Yingwei Pan , Yehao Li , Chong-Wah Ngo , Tao Mei

On the Surprising Effectiveness of Attention Transfer for Vision Transformers

Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. Is this actually true? We investigate this question and find that the features and representations…

Machine Learning · Computer Science 2024-11-15 Alexander C. Li , Yuandong Tian , Beidi Chen , Deepak Pathak , Xinlei Chen

Visual Representation Learning with Self-Supervised Attention for Low-Label High-data Regime

Self-supervision has shown outstanding results for natural language processing, and more recently, for image recognition. Simultaneously, vision transformers and its variants have emerged as a promising and scalable alternative to…

Computer Vision and Pattern Recognition · Computer Science 2022-02-01 Prarthana Bhattacharyya , Chenge Li , Xiaonan Zhao , István Fehérvári , Jason Sun

EA-ViT: Efficient Adaptation for Elastic Vision Transformer

Vision Transformers (ViTs) have emerged as a foundational model in computer vision, excelling in generalization and adaptation to downstream tasks. However, deploying ViTs to support diverse resource constraints typically requires…

Computer Vision and Pattern Recognition · Computer Science 2025-07-28 Chen Zhu , Wangbo Zhao , Huiwen Zhang , Samir Khaki , Yuhao Zhou , Weidong Tang , Shuo Wang , Zhihang Yuan , Yuzhang Shang , Xiaojiang Peng , Kai Wang , Dawei Yang

Dual Vision Transformer

Prior works have proposed several strategies to reduce the computational cost of self-attention mechanism. Many of these works consider decomposing the self-attention procedure into regional and local feature extraction procedures that each…

Computer Vision and Pattern Recognition · Computer Science 2022-07-13 Ting Yao , Yehao Li , Yingwei Pan , Yu Wang , Xiao-Ping Zhang , Tao Mei