Related papers: Learning Correlation Structures for Vision Transfo…

ReViT: Enhancing Vision Transformers Feature Diversity with Attention Residual Connections

Vision Transformer (ViT) self-attention mechanism is characterized by feature collapse in deeper layers, resulting in the vanishing of low-level visual features. However, such features can be helpful to accurately represent and identify…

Computer Vision and Pattern Recognition · Computer Science 2024-08-06 Anxhelo Diko , Danilo Avola , Marco Cascio , Luigi Cinque

Relational Self-Attention: What's Missing in Attention for Video Understanding

Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks,…

Computer Vision and Pattern Recognition · Computer Science 2021-11-03 Manjin Kim , Heeseung Kwon , Chunyu Wang , Suha Kwak , Minsu Cho

A Close Look at Spatial Modeling: From Attention to Convolution

Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. By revisiting the self-attention responses in Transformers, we empirically observe two…

Computer Vision and Pattern Recognition · Computer Science 2022-12-27 Xu Ma , Huan Wang , Can Qin , Kunpeng Li , Xingchen Zhao , Jie Fu , Yun Fu

Interpretable Vision Transformers in Image Classification via SVDA

Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors. In this work, we adapt our previously proposed…

Computer Vision and Pattern Recognition · Computer Science 2026-02-12 Vasileios Arampatzakis , George Pavlidis , Nikolaos Mitianoudis , Nikos Papamarkos

Focal Self-attention for Local-Global Interactions in Vision Transformers

Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. The ability of capturing short- and long-range visual dependencies through self-attention is arguably the main source for the success.…

Computer Vision and Pattern Recognition · Computer Science 2021-07-02 Jianwei Yang , Chunyuan Li , Pengchuan Zhang , Xiyang Dai , Bin Xiao , Lu Yuan , Jianfeng Gao

AttentionRNN: A Structured Spatial Attention Mechanism

Visual attention mechanisms have proven to be integrally important constituent components of many modern deep neural architectures. They provide an efficient and effective way to utilize visual information selectively, which has shown to be…

Computer Vision and Pattern Recognition · Computer Science 2019-05-24 Siddhesh Khandelwal , Leonid Sigal

Self-Segregating and Coordinated-Segregating Transformer for Focused Deep Multi-Modular Network for Visual Question Answering

Attention mechanism has gained huge popularity due to its effectiveness in achieving high accuracy in different domains. But attention is opportunistic and is not justified by the content or usability of the content. Transformer like…

Computer Vision and Pattern Recognition · Computer Science 2020-06-26 Chiranjib Sur

Lite Vision Transformer with Enhanced Self-Attention

Despite the impressive representation capacity of vision transformer models, current light-weight vision transformer models still suffer from inconsistent and incorrect dense predictions at local regions. We suspect that the power of their…

Computer Vision and Pattern Recognition · Computer Science 2021-12-22 Chenglin Yang , Yilin Wang , Jianming Zhang , He Zhang , Zijun Wei , Zhe Lin , Alan Yuille

Lightweight Structure-Aware Attention for Visual Understanding

Attention operator has been widely used as a basic brick in visual understanding since it provides some flexibility through its adjustable kernels. However, this operator suffers from inherent limitations: (1) the attention kernel is not…

Computer Vision and Pattern Recognition · Computer Science 2025-07-04 Heeseung Kwon , Francisco M. Castro , Manuel J. Marin-Jimenez , Nicolas Guil , Karteek Alahari

Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers

Vision Transformers (ViT) have shown their competitive advantages performance-wise compared to convolutional neural networks (CNNs) though they often come with high computational costs. To this end, previous methods explore different…

Computer Vision and Pattern Recognition · Computer Science 2023-03-27 Cong Wei , Brendan Duke , Ruowei Jiang , Parham Aarabi , Graham W. Taylor , Florian Shkurti

Vision Transformers provably learn spatial structure

Vision Transformers (ViTs) have achieved comparable or superior performance than Convolutional Neural Networks (CNNs) in computer vision. This empirical breakthrough is even more remarkable since, in contrast to CNNs, ViTs do not embed any…

Computer Vision and Pattern Recognition · Computer Science 2022-10-18 Samy Jelassi , Michael E. Sander , Yuanzhi Li

DeepViT: Towards Deeper Vision Transformer

Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the…

Computer Vision and Pattern Recognition · Computer Science 2021-04-20 Daquan Zhou , Bingyi Kang , Xiaojie Jin , Linjie Yang , Xiaochen Lian , Zihang Jiang , Qibin Hou , Jiashi Feng

CSA-Net: Channel-wise Spatially Autocorrelated Attention Networks

In recent years, convolutional neural networks (CNNs) with channel-wise feature refining mechanisms have brought noticeable benefits to modelling channel dependencies. However, current attention paradigms fail to infer an optimal channel…

Computer Vision and Pattern Recognition · Computer Science 2024-05-14 Nick Nikzad , Yongsheng Gao , Jun Zhou

Learning Fixation Point Strategy for Object Detection and Classification

We propose a novel recurrent attentional structure to localize and recognize objects jointly. The network can learn to extract a sequence of local observations with detailed appearance and rough context, instead of sliding windows or…

Computer Vision and Pattern Recognition · Computer Science 2017-12-20 Jie Lyu , Zejian Yuan , Dapeng Chen

ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

Convolutional architectures have proven extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision Transformers (ViTs) rely on…

Computer Vision and Pattern Recognition · Computer Science 2022-12-07 Stéphane d'Ascoli , Hugo Touvron , Matthew Leavitt , Ari Morcos , Giulio Biroli , Levent Sagun

Advancing Vision Transformers with Group-Mix Attention

Vision Transformers (ViTs) have been shown to enhance visual recognition through modeling long-range dependencies with multi-head self-attention (MHSA), which is typically formulated as Query-Key-Value computation. However, the attention…

Computer Vision and Pattern Recognition · Computer Science 2023-11-28 Chongjian Ge , Xiaohan Ding , Zhan Tong , Li Yuan , Jiangliu Wang , Yibing Song , Ping Luo

Attention as Binding: A Vector-Symbolic Perspective on Transformer Reasoning

Transformer-based language models display impressive reasoning-like behavior, yet remain brittle on tasks that require stable symbolic manipulation. This paper develops a unified perspective on these phenomena by interpreting self-attention…

Artificial Intelligence · Computer Science 2025-12-18 Sahil Rajesh Dhayalkar

Rotate to Attend: Convolutional Triplet Attention Module

Benefiting from the capability of building inter-dependencies among channels or spatial locations, attention mechanisms have been extensively studied and broadly used in a variety of computer vision tasks recently. In this paper, we…

Computer Vision and Pattern Recognition · Computer Science 2020-11-09 Diganta Misra , Trikay Nalamada , Ajay Uppili Arasanipalai , Qibin Hou

Scratching Visual Transformer's Back with Uniform Attention

The favorable performance of Vision Transformers (ViTs) is often attributed to the multi-head self-attention (MSA). The MSA enables global interactions at each layer of a ViT model, which is a contrasting feature against Convolutional…

Computer Vision and Pattern Recognition · Computer Science 2024-12-30 Nam Hyeon-Woo , Kim Yu-Ji , Byeongho Heo , Dongyoon Han , Seong Joon Oh , Tae-Hyun Oh

Contextual Transformer Networks for Visual Recognition

Transformer with self-attention has led to the revolutionizing of natural language processing field, and recently inspires the emergence of Transformer-style architecture design with competitive results in numerous computer vision tasks.…

Computer Vision and Pattern Recognition · Computer Science 2021-07-27 Yehao Li , Ting Yao , Yingwei Pan , Tao Mei