Related papers: Learning Object Focused Attention

ReViT: Enhancing Vision Transformers Feature Diversity with Attention Residual Connections

Vision Transformer (ViT) self-attention mechanism is characterized by feature collapse in deeper layers, resulting in the vanishing of low-level visual features. However, such features can be helpful to accurately represent and identify…

Computer Vision and Pattern Recognition · Computer Science 2024-08-06 Anxhelo Diko , Danilo Avola , Marco Cascio , Luigi Cinque

Interpretable Vision Transformers in Image Classification via SVDA

Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors. In this work, we adapt our previously proposed…

Computer Vision and Pattern Recognition · Computer Science 2026-02-12 Vasileios Arampatzakis , George Pavlidis , Nikolaos Mitianoudis , Nikos Papamarkos

You Only Need Less Attention at Each Stage in Vision Transformers

The advent of Vision Transformers (ViTs) marks a substantial paradigm shift in the realm of computer vision. ViTs capture the global information of images through self-attention modules, which perform dot product computations among…

Computer Vision and Pattern Recognition · Computer Science 2024-06-04 Shuoxi Zhang , Hanpeng Liu , Stephen Lin , Kun He

How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

Vision transformer (ViT) expands the success of transformer models from sequential data to images. The model decomposes an image into many smaller patches and arranges them into a sequence. Multi-head self-attentions are then applied to the…

Machine Learning · Computer Science 2023-03-27 Yiran Li , Junpeng Wang , Xin Dai , Liang Wang , Chin-Chia Michael Yeh , Yan Zheng , Wei Zhang , Kwan-Liu Ma

OAMixer: Object-aware Mixing Layer for Vision Transformers

Patch-based models, e.g., Vision Transformers (ViTs) and Mixers, have shown impressive results on various visual recognition tasks, alternating classic convolutional networks. While the initial patch-based models (ViTs) treated all patches…

Computer Vision and Pattern Recognition · Computer Science 2022-12-14 Hyunwoo Kang , Sangwoo Mo , Jinwoo Shin

Learning Diverse Features in Vision Transformers for Improved Generalization

Deep learning models often rely only on a small set of features even when there is a rich set of predictive signals in the training data. This makes models brittle and sensitive to distribution shifts. In this work, we first examine vision…

Computer Vision and Pattern Recognition · Computer Science 2023-09-01 Armand Mihai Nicolicioiu , Andrei Liviu Nicolicioiu , Bogdan Alexe , Damien Teney

Vision Transformers provably learn spatial structure

Vision Transformers (ViTs) have achieved comparable or superior performance than Convolutional Neural Networks (CNNs) in computer vision. This empirical breakthrough is even more remarkable since, in contrast to CNNs, ViTs do not embed any…

Computer Vision and Pattern Recognition · Computer Science 2022-10-18 Samy Jelassi , Michael E. Sander , Yuanzhi Li

Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention

Vision Transformer(ViT) is one of the most widely used models in the computer vision field with its great performance on various tasks. In order to fully utilize the ViT-based architecture in various applications, proper visualization…

Computer Vision and Pattern Recognition · Computer Science 2024-02-08 Saebom Leem , Hyunseok Seo

Attention Deficit is Ordered! Fooling Deformable Vision Transformers with Collaborative Adversarial Patches

The latest generation of transformer-based vision models has proven to be superior to Convolutional Neural Network (CNN)-based models across several vision tasks, largely attributed to their remarkable prowess in relation modeling.…

Computer Vision and Pattern Recognition · Computer Science 2023-12-29 Quazi Mishkatul Alam , Bilel Tarchoun , Ihsen Alouani , Nael Abu-Ghazaleh

On the Surprising Effectiveness of Attention Transfer for Vision Transformers

Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. Is this actually true? We investigate this question and find that the features and representations…

Machine Learning · Computer Science 2024-11-15 Alexander C. Li , Yuandong Tian , Beidi Chen , Deepak Pathak , Xinlei Chen

ODE-ViT: Plug & Play Attention Layer from the Generalization of the ViT as an Ordinary Differential Equation

In recent years, increasingly large models have achieved outstanding performance across CV tasks. However, these models demand substantial computational resources and storage, and their growing complexity limits our understanding of how…

Machine Learning · Computer Science 2025-11-21 Carlos Boned Riera , David Romero Sanchez , Oriol Ramos Terrades

Unified Local and Global Attention Interaction Modeling for Vision Transformers

We present a novel method that extends the self-attention mechanism of a vision transformer (ViT) for more accurate object detection across diverse datasets. ViTs show strong capability for image understanding tasks such as object…

Computer Vision and Pattern Recognition · Computer Science 2024-12-30 Tan Nguyen , Coy D. Heldermon , Corey Toler-Franklin

Spotlight Attention: Robust Object-Centric Learning With a Spatial Locality Prior

The aim of object-centric vision is to construct an explicit representation of the objects in a scene. This representation is obtained via a set of interchangeable modules called \emph{slots} or \emph{object files} that compete for local…

Computer Vision and Pattern Recognition · Computer Science 2023-06-06 Ayush Chakravarthy , Trang Nguyen , Anirudh Goyal , Yoshua Bengio , Michael C. Mozer

Preserving Locality in Vision Transformers for Class Incremental Learning

Learning new classes without forgetting is crucial for real-world applications for a classification model. Vision Transformers (ViT) recently achieve remarkable performance in Class Incremental Learning (CIL). Previous works mainly focus on…

Machine Learning · Computer Science 2023-04-17 Bowen Zheng , Da-Wei Zhou , Han-Jia Ye , De-Chuan Zhan

Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields

Vision transformers (ViTs) that model an image as a sequence of partitioned patches have shown notable performance in diverse vision tasks. Because partitioning patches eliminates the image structure, to reflect the order of patches, ViTs…

Computer Vision and Pattern Recognition · Computer Science 2023-05-09 Bum Jun Kim , Hyeyeon Choi , Hyeonah Jang , Sang Woo Kim

Object-aware Feature Aggregation for Video Object Detection

We present an Object-aware Feature Aggregation (OFA) module for video object detection (VID). Our approach is motivated by the intriguing property that video-level object-aware knowledge can be employed as a powerful semantic prior to help…

Computer Vision and Pattern Recognition · Computer Science 2020-10-26 Qichuan Geng , Hong Zhang , Na Jiang , Xiaojuan Qi , Liangjun Zhang , Zhong Zhou

Vision Transformer with Deformable Attention

Transformers have recently shown superior performances on various vision tasks. The large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts. Nevertheless, simply…

Computer Vision and Pattern Recognition · Computer Science 2022-05-25 Zhuofan Xia , Xuran Pan , Shiji Song , Li Erran Li , Gao Huang

Vision Xformers: Efficient Attention for Image Classification

Although transformers have become the neural architectures of choice for natural language processing, they require orders of magnitude more training data, GPU memory, and computations in order to compete with convolutional neural networks…

Computer Vision and Pattern Recognition · Computer Science 2021-10-04 Pranav Jeevan , Amit Sethi

Differentiable Soft-Masked Attention

Transformers have become prevalent in computer vision due to their performance and flexibility in modelling complex operations. Of particular significance is the 'cross-attention' operation, which allows a vector representation (e.g. of an…

Computer Vision and Pattern Recognition · Computer Science 2022-08-08 Ali Athar , Jonathon Luiten , Alexander Hermans , Deva Ramanan , Bastian Leibe

Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference

Vision Transformers (ViTs) have shown impressive performance but still require a high computation cost as compared to convolutional neural networks (CNNs), one reason is that ViTs' attention measures global similarities and thus has a…

Computer Vision and Pattern Recognition · Computer Science 2024-07-26 Haoran You , Yunyang Xiong , Xiaoliang Dai , Bichen Wu , Peizhao Zhang , Haoqi Fan , Peter Vajda , Yingyan Celine Lin