English
Related papers

Related papers: Learning Correlation Structures for Vision Transfo…

200 papers

Vision Transformer (ViT) self-attention mechanism is characterized by feature collapse in deeper layers, resulting in the vanishing of low-level visual features. However, such features can be helpful to accurately represent and identify…

Computer Vision and Pattern Recognition · Computer Science 2024-08-06 Anxhelo Diko , Danilo Avola , Marco Cascio , Luigi Cinque

Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks,…

Computer Vision and Pattern Recognition · Computer Science 2021-11-03 Manjin Kim , Heeseung Kwon , Chunyu Wang , Suha Kwak , Minsu Cho

Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. By revisiting the self-attention responses in Transformers, we empirically observe two…

Computer Vision and Pattern Recognition · Computer Science 2022-12-27 Xu Ma , Huan Wang , Can Qin , Kunpeng Li , Xingchen Zhao , Jie Fu , Yun Fu

Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors. In this work, we adapt our previously proposed…

Computer Vision and Pattern Recognition · Computer Science 2026-02-12 Vasileios Arampatzakis , George Pavlidis , Nikolaos Mitianoudis , Nikos Papamarkos

Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. The ability of capturing short- and long-range visual dependencies through self-attention is arguably the main source for the success.…

Computer Vision and Pattern Recognition · Computer Science 2021-07-02 Jianwei Yang , Chunyuan Li , Pengchuan Zhang , Xiyang Dai , Bin Xiao , Lu Yuan , Jianfeng Gao

Visual attention mechanisms have proven to be integrally important constituent components of many modern deep neural architectures. They provide an efficient and effective way to utilize visual information selectively, which has shown to be…

Computer Vision and Pattern Recognition · Computer Science 2019-05-24 Siddhesh Khandelwal , Leonid Sigal

Attention mechanism has gained huge popularity due to its effectiveness in achieving high accuracy in different domains. But attention is opportunistic and is not justified by the content or usability of the content. Transformer like…

Computer Vision and Pattern Recognition · Computer Science 2020-06-26 Chiranjib Sur

Despite the impressive representation capacity of vision transformer models, current light-weight vision transformer models still suffer from inconsistent and incorrect dense predictions at local regions. We suspect that the power of their…

Computer Vision and Pattern Recognition · Computer Science 2021-12-22 Chenglin Yang , Yilin Wang , Jianming Zhang , He Zhang , Zijun Wei , Zhe Lin , Alan Yuille

Attention operator has been widely used as a basic brick in visual understanding since it provides some flexibility through its adjustable kernels. However, this operator suffers from inherent limitations: (1) the attention kernel is not…

Computer Vision and Pattern Recognition · Computer Science 2025-07-04 Heeseung Kwon , Francisco M. Castro , Manuel J. Marin-Jimenez , Nicolas Guil , Karteek Alahari

Vision Transformers (ViT) have shown their competitive advantages performance-wise compared to convolutional neural networks (CNNs) though they often come with high computational costs. To this end, previous methods explore different…

Computer Vision and Pattern Recognition · Computer Science 2023-03-27 Cong Wei , Brendan Duke , Ruowei Jiang , Parham Aarabi , Graham W. Taylor , Florian Shkurti

Vision Transformers (ViTs) have achieved comparable or superior performance than Convolutional Neural Networks (CNNs) in computer vision. This empirical breakthrough is even more remarkable since, in contrast to CNNs, ViTs do not embed any…

Computer Vision and Pattern Recognition · Computer Science 2022-10-18 Samy Jelassi , Michael E. Sander , Yuanzhi Li

Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the…

Computer Vision and Pattern Recognition · Computer Science 2021-04-20 Daquan Zhou , Bingyi Kang , Xiaojie Jin , Linjie Yang , Xiaochen Lian , Zihang Jiang , Qibin Hou , Jiashi Feng

In recent years, convolutional neural networks (CNNs) with channel-wise feature refining mechanisms have brought noticeable benefits to modelling channel dependencies. However, current attention paradigms fail to infer an optimal channel…

Computer Vision and Pattern Recognition · Computer Science 2024-05-14 Nick Nikzad , Yongsheng Gao , Jun Zhou

We propose a novel recurrent attentional structure to localize and recognize objects jointly. The network can learn to extract a sequence of local observations with detailed appearance and rough context, instead of sliding windows or…

Computer Vision and Pattern Recognition · Computer Science 2017-12-20 Jie Lyu , Zejian Yuan , Dapeng Chen

Convolutional architectures have proven extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision Transformers (ViTs) rely on…

Computer Vision and Pattern Recognition · Computer Science 2022-12-07 Stéphane d'Ascoli , Hugo Touvron , Matthew Leavitt , Ari Morcos , Giulio Biroli , Levent Sagun

Vision Transformers (ViTs) have been shown to enhance visual recognition through modeling long-range dependencies with multi-head self-attention (MHSA), which is typically formulated as Query-Key-Value computation. However, the attention…

Computer Vision and Pattern Recognition · Computer Science 2023-11-28 Chongjian Ge , Xiaohan Ding , Zhan Tong , Li Yuan , Jiangliu Wang , Yibing Song , Ping Luo

Transformer-based language models display impressive reasoning-like behavior, yet remain brittle on tasks that require stable symbolic manipulation. This paper develops a unified perspective on these phenomena by interpreting self-attention…

Artificial Intelligence · Computer Science 2025-12-18 Sahil Rajesh Dhayalkar

Benefiting from the capability of building inter-dependencies among channels or spatial locations, attention mechanisms have been extensively studied and broadly used in a variety of computer vision tasks recently. In this paper, we…

Computer Vision and Pattern Recognition · Computer Science 2020-11-09 Diganta Misra , Trikay Nalamada , Ajay Uppili Arasanipalai , Qibin Hou

The favorable performance of Vision Transformers (ViTs) is often attributed to the multi-head self-attention (MSA). The MSA enables global interactions at each layer of a ViT model, which is a contrasting feature against Convolutional…

Computer Vision and Pattern Recognition · Computer Science 2024-12-30 Nam Hyeon-Woo , Kim Yu-Ji , Byeongho Heo , Dongyoon Han , Seong Joon Oh , Tae-Hyun Oh

Transformer with self-attention has led to the revolutionizing of natural language processing field, and recently inspires the emergence of Transformer-style architecture design with competitive results in numerous computer vision tasks.…

Computer Vision and Pattern Recognition · Computer Science 2021-07-27 Yehao Li , Ting Yao , Yingwei Pan , Tao Mei
‹ Prev 1 2 3 10 Next ›