Related papers: Multiscale Vision Transformers

MMViT: Multiscale Multiview Vision Transformers

We present Multiscale Multiview Vision Transformers (MMViT), which introduces multiscale feature maps and multiview encodings to transformer models. Our model encodes different views of the input signal and builds several channel-resolution…

Computer Vision and Pattern Recognition · Computer Science 2023-05-02 Yuchen Liu , Natasha Ong , Kaiyan Peng , Bo Xiong , Qifan Wang , Rui Hou , Madian Khabsa , Kaiyue Yang , David Liu , Donald S. Williamson , Hanchao Yu

Multiview Transformers for Video Recognition

Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations. Although transformer architectures have recently advanced the state-of-the-art,…

Computer Vision and Pattern Recognition · Computer Science 2022-06-01 Shen Yan , Xuehan Xiong , Anurag Arnab , Zhichao Lu , Mi Zhang , Chen Sun , Cordelia Schmid

MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy

Modern microscopy routinely produces gigapixel images that contain structures across multiple spatial scales, from fine cellular morphology to broader tissue organization. Many analysis tasks require combining these scales, yet most vision…

Computer Vision and Pattern Recognition · Computer Science 2026-03-02 Albert Dominguez Mantes , Gioele La Manno , Martin Weigert

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

While today's video recognition systems parse snapshots or short clips accurately, they cannot connect the dots and reason across a longer range of time yet. Most existing video architectures can only process <5 seconds of a video without…

Computer Vision and Pattern Recognition · Computer Science 2022-12-02 Chao-Yuan Wu , Yanghao Li , Karttikeya Mangalam , Haoqi Fan , Bo Xiong , Jitendra Malik , Christoph Feichtenhofer

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of \cite{dosovitskiy2020image} for encoding high-resolution images using two techniques. The first is the…

Computer Vision and Pattern Recognition · Computer Science 2021-05-28 Pengchuan Zhang , Xiyang Dai , Jianwei Yang , Bin Xiao , Lu Yuan , Lei Zhang , Jianfeng Gao

CoMViT: An Efficient Vision Backbone for Supervised Classification in Medical Imaging

Vision Transformers (ViTs) have demonstrated strong potential in medical imaging; however, their high computational demands and tendency to overfit on small datasets limit their applicability in real-world clinical scenarios. In this paper,…

Computer Vision and Pattern Recognition · Computer Science 2025-11-03 Aon Safdar , Mohamed Saadeldin

Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words

Vision Transformer (ViT) has emerged as a powerful architecture in the realm of modern computer vision. However, its application in certain imaging fields, such as microscopy and satellite imaging, presents unique challenges. In these…

Computer Vision and Pattern Recognition · Computer Science 2024-04-22 Yujia Bao , Srinivasan Sivanandan , Theofanis Karaletsos

HSViT: Horizontally Scalable Vision Transformer

Due to its deficiency in prior knowledge (inductive bias), Vision Transformer (ViT) requires pre-training on large-scale datasets to perform well. Moreover, the growing layers and parameters in ViT models impede their applicability to…

Computer Vision and Pattern Recognition · Computer Science 2024-07-17 Chenhao Xu , Chang-Tsun Li , Chee Peng Lim , Douglas Creighton

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative…

Computer Vision and Pattern Recognition · Computer Science 2022-03-31 Yanghao Li , Chao-Yuan Wu , Haoqi Fan , Karttikeya Mangalam , Bo Xiong , Jitendra Malik , Christoph Feichtenhofer

Class-agnostic Object Detection with Multi-modal Transformer

What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale…

Computer Vision and Pattern Recognition · Computer Science 2022-07-20 Muhammad Maaz , Hanoona Rasheed , Salman Khan , Fahad Shahbaz Khan , Rao Muhammad Anwer , Ming-Hsuan Yang

ViViT: A Video Vision Transformer

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of…

Computer Vision and Pattern Recognition · Computer Science 2021-11-02 Anurag Arnab , Mostafa Dehghani , Georg Heigold , Chen Sun , Mario Lučić , Cordelia Schmid

Scalable Vision Transformers with Hierarchical Pooling

The recently proposed Visual image Transformers (ViT) with pure attention have achieved promising performance on image recognition tasks, such as image classification. However, the routine of the current ViT model is to maintain a…

Computer Vision and Pattern Recognition · Computer Science 2021-08-19 Zizheng Pan , Bohan Zhuang , Jing Liu , Haoyu He , Jianfei Cai

VariViT: A Vision Transformer for Variable Image Sizes

Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning, leveraging self-attention mechanisms to excel in various tasks. ViTs split images into fixed-size patches, constraining them to a…

Computer Vision and Pattern Recognition · Computer Science 2026-02-17 Aswathi Varma , Suprosanna Shit , Chinmay Prabhakar , Daniel Scholz , Hongwei Bran Li , Bjoern Menze , Daniel Rueckert , Benedikt Wiestler

A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation

This work presents a simple vision transformer design as a strong baseline for object localization and instance segmentation tasks. Transformers recently demonstrate competitive performance in image classification tasks. To adopt ViT to…

Computer Vision and Pattern Recognition · Computer Science 2022-10-04 Wuyang Chen , Xianzhi Du , Fan Yang , Lucas Beyer , Xiaohua Zhai , Tsung-Yi Lin , Huizhong Chen , Jing Li , Xiaodan Song , Zhangyang Wang , Denny Zhou

MPViT: Multi-Path Vision Transformer for Dense Prediction

Dense computer vision tasks such as object detection and segmentation require effective multi-scale feature representation for detecting or classifying objects or regions with varying sizes. While Convolutional Neural Networks (CNNs) have…

Computer Vision and Pattern Recognition · Computer Science 2021-12-28 Youngwan Lee , Jonghee Kim , Jeff Willette , Sung Ju Hwang

LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution

Recent Vision Transformer (ViT)-based methods for Image Super-Resolution have demonstrated impressive performance. However, they suffer from significant complexity, resulting in high inference times and memory usage. Additionally, ViT…

Computer Vision and Pattern Recognition · Computer Science 2024-09-06 Jeongsoo Kim , Jongho Nang , Junsuk Choe

MVFormer: Diversifying Feature Normalization and Token Mixing for Efficient Vision Transformers

Active research is currently underway to enhance the efficiency of vision transformers (ViTs). Most studies have focused solely on effective token mixers, overlooking the potential relationship with normalization. To boost diverse feature…

Computer Vision and Pattern Recognition · Computer Science 2024-12-02 Jongseong Bae , Susang Kim , Minsu Cho , Ha Young Kim

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Pretraining Vision Transformers (ViTs) has achieved great success in visual recognition. A following scenario is to adapt a ViT to various image and video recognition tasks. The adaptation is challenging because of heavy computation and…

Computer Vision and Pattern Recognition · Computer Science 2022-10-18 Shoufa Chen , Chongjian Ge , Zhan Tong , Jiangliu Wang , Yibing Song , Jue Wang , Ping Luo

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

The recently developed vision transformer (ViT) has achieved promising results on image classification compared to convolutional neural networks. Inspired by this, in this paper, we study how to learn multi-scale feature representations in…

Computer Vision and Pattern Recognition · Computer Science 2021-08-24 Chun-Fu Chen , Quanfu Fan , Rameswar Panda

MVT: Multi-view Vision Transformer for 3D Object Recognition

Inspired by the great success achieved by CNN in image recognition, view-based methods applied CNNs to model the projected views for 3D object understanding and achieved excellent performance. Nevertheless, multi-view CNN models cannot…

Computer Vision and Pattern Recognition · Computer Science 2021-10-26 Shuo Chen , Tan Yu , Ping Li