Related papers: Beyond Fixation: Dynamic Window Visual Transformer

Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window

Transformer models have shown great potential in computer vision, following their success in language tasks. Swin Transformer is one of them that outperforms convolution-based architectures in terms of accuracy, while improving efficiency…

Computer Vision and Pattern Recognition · Computer Science 2023-06-27 Jinkyu Koo , John Yang , Le An , Gwenaelle Cunha Sergio , Su Inn Park

SimViT: Exploring a Simple Vision Transformer with sliding windows

Although vision Transformers have achieved excellent performance as backbone models in many vision tasks, most of them intend to capture global relations of all tokens in an image or a window, which disrupts the inherent spatial and local…

Computer Vision and Pattern Recognition · Computer Science 2021-12-28 Gang Li , Di Xu , Xing Cheng , Lingyu Si , Changwen Zheng

Dual Vision Transformer

Prior works have proposed several strategies to reduce the computational cost of self-attention mechanism. Many of these works consider decomposing the self-attention procedure into regional and local feature extraction procedures that each…

Computer Vision and Pattern Recognition · Computer Science 2022-07-13 Ting Yao , Yehao Li , Yingwei Pan , Yu Wang , Xiao-Ping Zhang , Tao Mei

Factorization Vision Transformer: Modeling Long Range Dependency with Local Window Cost

Transformers have astounding representational power but typically consume considerable computation which is quadratic with image resolution. The prevailing Swin transformer reduces computational costs through a local window strategy.…

Computer Vision and Pattern Recognition · Computer Science 2023-12-15 Haolin Qin , Daquan Zhou , Tingfa Xu , Ziyang Bian , Jianan Li

Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning

Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for computer vision tasks, while the self-attention computation in Transformer scales quadratically w.r.t. the input patch number. Thus, existing solutions commonly…

Computer Vision and Pattern Recognition · Computer Science 2022-07-12 Ting Yao , Yingwei Pan , Yehao Li , Chong-Wah Ngo , Tao Mei

Multi-Dimensional Hyena for Spatial Inductive Bias

In recent years, Vision Transformers have attracted increasing interest from computer vision researchers. However, the advantage of these transformers over CNNs is only fully manifested when trained over a large dataset, mainly due to the…

Computer Vision and Pattern Recognition · Computer Science 2023-09-26 Itamar Zimerman , Lior Wolf

DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion

Self-attention-based vision transformers (ViTs) have emerged as a highly competitive architecture in computer vision. Unlike convolutional neural networks (CNNs), ViTs are capable of global information sharing. With the development of…

Computer Vision and Pattern Recognition · Computer Science 2023-09-25 Zhenzhen Chu , Jiayu Chen , Cen Chen , Chengyu Wang , Ziheng Wu , Jun Huang , Weining Qian

VSA: Learning Varied-Size Window Attention in Vision Transformers

Attention within windows has been widely explored in vision transformers to balance the performance, computation complexity, and memory footprint. However, current models adopt a hand-crafted fixed-size window design, which restricts their…

Computer Vision and Pattern Recognition · Computer Science 2023-07-04 Qiming Zhang , Yufei Xu , Jing Zhang , Dacheng Tao

CoSwin: Convolution Enhanced Hierarchical Shifted Window Attention For Small-Scale Vision

Vision Transformers (ViTs) have achieved impressive results in computer vision by leveraging self-attention to model long-range dependencies. However, their emphasis on global context often comes at the expense of local feature extraction…

Computer Vision and Pattern Recognition · Computer Science 2025-09-12 Puskal Khadka , Rodrigue Rizk , Longwei Wang , KC Santosh

Global Context Vision Transformers

We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision. Our method leverages global context self-attention modules, joint with standard local…

Computer Vision and Pattern Recognition · Computer Science 2023-06-07 Ali Hatamizadeh , Hongxu Yin , Greg Heinrich , Jan Kautz , Pavlo Molchanov

BOAT: Bilateral Local Attention Vision Transformer

Vision Transformers achieved outstanding performance in many computer vision tasks. Early Vision Transformers such as ViT and DeiT adopt global self-attention, which is computationally expensive when the number of patches is large. To…

Computer Vision and Pattern Recognition · Computer Science 2022-10-20 Tan Yu , Gangming Zhao , Ping Li , Yizhou Yu

Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

Multi-scale representations are crucial for semantic segmentation. The community has witnessed the flourish of semantic segmentation convolutional neural networks (CNN) exploiting multi-scale contextual information. Motivated by that the…

Computer Vision and Pattern Recognition · Computer Science 2023-08-10 Haotian Yan , Chuang Zhang , Ming Wu

Super Vision Transformer

We attempt to reduce the computational costs in vision transformers (ViTs), which increase quadratically in the token number. We present a novel training paradigm that trains only one ViT model at a time, but is capable of providing…

Computer Vision and Pattern Recognition · Computer Science 2023-07-20 Mingbao Lin , Mengzhao Chen , Yuxin Zhang , Chunhua Shen , Rongrong Ji , Liujuan Cao

Degenerate Swin to Win: Plain Window-based Transformer without Sophisticated Operations

The formidable accomplishment of Transformers in natural language processing has motivated the researchers in the computer vision community to build Vision Transformers. Compared with the Convolution Neural Networks (CNN), a Vision…

Computer Vision and Pattern Recognition · Computer Science 2022-11-28 Tan Yu , Ping Li

COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models

Attention-based vision models, such as Vision Transformer (ViT) and its variants, have shown promising performance in various computer vision tasks. However, these emerging architectures suffer from large model sizes and high computational…

Computer Vision and Pattern Recognition · Computer Science 2024-12-04 Jinqi Xiao , Miao Yin , Yu Gong , Xiao Zang , Jian Ren , Bo Yuan

LaVin-DiT: Large Vision Diffusion Transformer

This paper presents the Large Vision Diffusion Transformer (LaVin-DiT), a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework. Unlike existing large vision models directly adapted…

Computer Vision and Pattern Recognition · Computer Science 2025-03-07 Zhaoqing Wang , Xiaobo Xia , Runnan Chen , Dongdong Yu , Changhu Wang , Mingming Gong , Tongliang Liu

DSwinIR: Rethinking Window-based Attention for Image Restoration

Image restoration has witnessed significant advancements with the development of deep learning models. Transformer-based models, particularly those using window-based self-attention, have become a dominant force. However, their performance…

Computer Vision and Pattern Recognition · Computer Science 2025-12-30 Gang Wu , Junjun Jiang , Kui Jiang , Xianming Liu , Liqiang Nie

HSViT: Horizontally Scalable Vision Transformer

Due to its deficiency in prior knowledge (inductive bias), Vision Transformer (ViT) requires pre-training on large-scale datasets to perform well. Moreover, the growing layers and parameters in ViT models impede their applicability to…

Computer Vision and Pattern Recognition · Computer Science 2024-07-17 Chenhao Xu , Chang-Tsun Li , Chee Peng Lim , Douglas Creighton

ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer

The vanilla self-attention mechanism inherently relies on pre-defined and steadfast computational dimensions. Such inflexibility restricts it from possessing context-oriented generalization that can bring more contextual cues and global…

Computer Vision and Pattern Recognition · Computer Science 2022-07-19 Rui Yang , Hailong Ma , Jie Wu , Yansong Tang , Xuefeng Xiao , Min Zheng , Xiu Li

A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking

Vision Transformer (ViT) architectures are becoming increasingly popular and widely employed to tackle computer vision applications. Their main feature is the capacity to extract global information through the self-attention mechanism,…

Computer Vision and Pattern Recognition · Computer Science 2024-05-06 Lorenzo Papa , Paolo Russo , Irene Amerini , Luping Zhou