Related papers: Accelerating Vision Transformer Training via a Pat…

TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

Vision Transformers (ViTs) have been widely used in large-scale Vision and Language Pre-training (VLP) models. Though previous VLP works have proved the effectiveness of ViTs, they still suffer from computational efficiency brought by the…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Chaoya Jiang , Haiyang Xu , Chenliang Li , Miang Yan , Wei Ye , Shikun Zhang , Bin Bi , Songfang Huang

Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

Vision Transformers (ViTs) have become increasingly popular in large-scale Vision and Language Pre-training (VLP) models. Although previous VLP research has demonstrated the efficacy of ViTs, these efforts still struggle with computational…

Computer Vision and Pattern Recognition · Computer Science 2024-03-14 Wei Ye , Chaoya Jiang , Haiyang Xu , Chenhao Ye , Chenliang Li , Ming Yan , Shikun Zhang , Songhang Huang , Fei Huang

Accelerating Vision Transformers with Adaptive Patch Sizes

Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses…

Computer Vision and Pattern Recognition · Computer Science 2026-04-24 Rohan Choudhury , JungEun Kim , Jinhyung Park , Eunho Yang , László A. Jeni , Kris M. Kitani

Effective Vision Transformer Training: A Data-Centric Perspective

Vision Transformers (ViTs) have shown promising performance compared with Convolutional Neural Networks (CNNs), but the training of ViTs is much harder than CNNs. In this paper, we define several metrics, including Dynamic Data Proportion…

Computer Vision and Pattern Recognition · Computer Science 2022-09-30 Benjia Zhou , Pichao Wang , Jun Wan , Yanyan Liang , Fan Wang

Vision Transformer for Small-Size Datasets

Recently, the Vision Transformer (ViT), which applied the transformer structure to the image classification task, has outperformed convolutional neural networks. However, the high performance of the ViT results from pre-training using a…

Computer Vision and Pattern Recognition · Computer Science 2021-12-28 Seung Hoon Lee , Seunghyun Lee , Byung Cheol Song

Vision Transformer with Progressive Sampling

Transformers with powerful global relation modeling abilities have been introduced to fundamental computer vision tasks recently. As a typical example, the Vision Transformer (ViT) directly applies a pure transformer architecture on image…

Computer Vision and Pattern Recognition · Computer Science 2021-08-05 Xiaoyu Yue , Shuyang Sun , Zhanghui Kuang , Meng Wei , Philip Torr , Wayne Zhang , Dahua Lin

Patch Slimming for Efficient Vision Transformers

This paper studies the efficiency problem for visual transformers by excavating redundant calculation in given networks. The recent transformer architecture has demonstrated its effectiveness for achieving excellent performance on a series…

Computer Vision and Pattern Recognition · Computer Science 2022-04-05 Yehui Tang , Kai Han , Yunhe Wang , Chang Xu , Jianyuan Guo , Chao Xu , Dacheng Tao

EPS: Efficient Patch Sampling for Video Overfitting in Deep Super-Resolution Model Training

Leveraging the overfitting property of deep neural networks (DNNs) is trending in video delivery systems to enhance video quality within bandwidth limits. Existing approaches transmit overfitted super-resolution (SR) model streams for…

Computer Vision and Pattern Recognition · Computer Science 2026-04-22 Yiying Wei , Hadi Amirpour , Jong Hwan Ko , Christian Timmerer

BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization

Vision Transformer (ViT) based Vision-Language Pre-training (VLP) models have demonstrated impressive performance in various tasks. However, the lengthy visual token sequences fed into ViT can lead to training inefficiency and…

Computer Vision and Pattern Recognition · Computer Science 2024-02-27 Chaoya Jiang , Haiyang Xu , Wei Ye , Qinghao Ye , Chenliang Li , Ming Yan , Bin Bi , Shikun Zhang , Fei Huang , Songfang Huang

Efficient Vision Transformer for Human Pose Estimation via Patch Selection

While Convolutional Neural Networks (CNNs) have been widely successful in 2D human pose estimation, Vision Transformers (ViTs) have emerged as a promising alternative to CNNs, boosting state-of-the-art performance. However, the quadratic…

Computer Vision and Pattern Recognition · Computer Science 2023-11-23 Kaleab A. Kinfu , Rene Vidal

Effect of Patch Size on Fine-Tuning Vision Transformers in Two-Dimensional and Three-Dimensional Medical Image Classification

Vision Transformers (ViTs) and their variants have become state-of-the-art in many computer vision tasks and are widely used as backbones in large-scale vision and vision-language foundation models. While substantial research has focused on…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Massoud Dehghan , Ramona Woitek , Amirreza Mahbod

Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer

Vision Transformers (ViTs) enabled the use of the transformer architecture on vision tasks showing impressive performances when trained on big datasets. However, on relatively small datasets, ViTs are less accurate given their lack of…

Computer Vision and Pattern Recognition · Computer Science 2022-10-14 Guglielmo Camporese , Elena Izzo , Lamberto Ballan

FlexiViT: One Model for All Patch Sizes

Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the…

Computer Vision and Pattern Recognition · Computer Science 2023-03-27 Lucas Beyer , Pavel Izmailov , Alexander Kolesnikov , Mathilde Caron , Simon Kornblith , Xiaohua Zhai , Matthias Minderer , Michael Tschannen , Ibrahim Alabdulmohsin , Filip Pavetic

Video Patch Pruning: Efficient Video Instance Segmentation via Early Token Reduction

Vision Transformers (ViTs) have demonstrated state-ofthe-art performance in several benchmarks, yet their high computational costs hinders their practical deployment. Patch Pruning offers significant savings, but existing approaches…

Computer Vision and Pattern Recognition · Computer Science 2026-04-02 Patrick Glandorf , Thomas Norrenbrock , Bodo Rosenhahn

Spiking Neural Networks with Dynamic Time Steps for Vision Transformers

Spiking Neural Networks (SNNs) have emerged as a popular spatio-temporal computing paradigm for complex vision tasks. Recently proposed SNN training algorithms have significantly reduced the number of time steps (down to 1) for improved…

Computer Vision and Pattern Recognition · Computer Science 2023-11-29 Gourav Datta , Zeyu Liu , Anni Li , Peter A. Beerel

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head self-attention (MHSA) among them. Complete leverage of these image tokens brings redundant computations since not all the tokens are attentive in MHSA.…

Computer Vision and Pattern Recognition · Computer Science 2022-04-15 Youwei Liang , Chongjian Ge , Zhan Tong , Yibing Song , Jue Wang , Pengtao Xie

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting. Conventional post-training pruning often incurs higher training budgets. In contrast, this paper aims…

Computer Vision and Pattern Recognition · Computer Science 2021-10-26 Tianlong Chen , Yu Cheng , Zhe Gan , Lu Yuan , Lei Zhang , Zhangyang Wang

Vision Transformers provably learn spatial structure

Vision Transformers (ViTs) have achieved comparable or superior performance than Convolutional Neural Networks (CNNs) in computer vision. This empirical breakthrough is even more remarkable since, in contrast to CNNs, ViTs do not embed any…

Computer Vision and Pattern Recognition · Computer Science 2022-10-18 Samy Jelassi , Michael E. Sander , Yuanzhi Li

Super Vision Transformer

We attempt to reduce the computational costs in vision transformers (ViTs), which increase quadratically in the token number. We present a novel training paradigm that trains only one ViT model at a time, but is capable of providing…

Computer Vision and Pattern Recognition · Computer Science 2023-07-20 Mingbao Lin , Mengzhao Chen , Yuxin Zhang , Chunhua Shen , Rongrong Ji , Liujuan Cao

Exploiting Spatial Sparsity for Event Cameras with Visual Transformers

Event cameras report local changes of brightness through an asynchronous stream of output events. Events are spatially sparse at pixel locations with little brightness variation. We propose using a visual transformer (ViT) architecture to…

Computer Vision and Pattern Recognition · Computer Science 2022-02-11 Zuowen Wang , Yuhuang Hu , Shih-Chii Liu