Related papers: Sparse and Structured Visual Attention

Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering

A key solution to visual question answering (VQA) exists in how to fuse visual and language features extracted from an input image and question. We show that an attention mechanism that enables dense, bi-directional interactions between the…

Computer Vision and Pattern Recognition · Computer Science 2018-12-04 Duy-Kien Nguyen , Takayuki Okatani

Knowing Where to Look? Analysis on Attention of Visual Question Answering System

Attention mechanisms have been widely used in Visual Question Answering (VQA) solutions due to their capacity to model deep cross-domain interactions. Analyzing attention maps offers us a perspective to find out limitations of current VQA…

Computer Vision and Pattern Recognition · Computer Science 2018-10-10 Wei Li , Zehuan Yuan , Xiangzhong Fang , Changhu Wang

Question-Agnostic Attention for Visual Question Answering

Visual Question Answering (VQA) models employ attention mechanisms to discover image locations that are most relevant for answering a specific question. For this purpose, several multimodal fusion strategies have been proposed, ranging from…

Computer Vision and Pattern Recognition · Computer Science 2021-08-26 Moshiur R Farazi , Salman H Khan , Nick Barnes

Task-driven Visual Saliency and Attention-based Visual Question Answering

Visual question answering (VQA) has witnessed great progress since May, 2015 as a classic problem unifying visual and textual data into a system. Many enlightening VQA works explore deep into the image and question encodings and fusing…

Computer Vision and Pattern Recognition · Computer Science 2017-02-23 Yuetan Lin , Zhangyang Pang , Donghui Wang , Yueting Zhuang

Reciprocal Attention Fusion for Visual Question Answering

Existing attention mechanisms either attend to local image grid or object level features for Visual Question Answering (VQA). Motivated by the observation that questions can relate to both object instances and their parts, we propose a…

Computer Vision and Pattern Recognition · Computer Science 2021-08-30 Moshiur R Farazi , Salman H Khan

Multimodal Continuous Visual Attention Mechanisms

Visual attention mechanisms are a key component of neural network models for computer vision. By focusing on a discrete set of objects or image regions, these mechanisms identify the most relevant features and use them to build more…

Computer Vision and Pattern Recognition · Computer Science 2021-04-08 António Farinhas , André F. T. Martins , Pedro M. Q. Aguiar

Analysis of Visual Question Answering Algorithms with attention model

Visual question answering (VQA) usesimage processing algorithms to process the image and natural language processing methods to understand and answer the question. VQA is helpful to a visually impaired person, can be used for the security…

Computer Vision and Pattern Recognition · Computer Science 2023-05-31 Param Ahir , Hiteishi M. Diwanji

Adaptively Sparse Transformers

Attention mechanisms have become ubiquitous in NLP. Recent architectures, notably the Transformer, learn powerful context-aware word representations through layered, multi-headed attention. The multiple heads learn diverse types of word…

Computation and Language · Computer Science 2019-09-09 Gonçalo M. Correia , Vlad Niculae , André F. T. Martins

A Cheap Linear Attention Mechanism with Fast Lookups and Fixed-Size Representations

The softmax content-based attention mechanism has proven to be very beneficial in many applications of recurrent neural networks. Nevertheless it suffers from two major computational limitations. First, its computations for an attention…

Machine Learning · Computer Science 2016-09-20 Alexandre de Brébisson , Pascal Vincent

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of…

Computation and Language · Computer Science 2025-05-13 Zihan Qiu , Zekun Wang , Bo Zheng , Zeyu Huang , Kaiyue Wen , Songlin Yang , Rui Men , Le Yu , Fei Huang , Suozhi Huang , Dayiheng Liu , Jingren Zhou , Junyang Lin

Long-Context Generalization with Sparse Attention

Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for…

Computation and Language · Computer Science 2026-03-03 Pavlo Vasylenko , Hugo Pitorro , André F. T. Martins , Marcos Treviso

A Regularized Framework for Sparse and Structured Neural Attention

Modern neural networks are often augmented with an attention mechanism, which tells the network where to focus within the input. We propose in this paper a new framework for sparse and structured attention, building upon a smoothed max…

Machine Learning · Statistics 2019-02-26 Vlad Niculae , Mathieu Blondel

Scalable-Softmax Is Superior for Attention

The maximum element of the vector output by the Softmax function approaches zero as the input vector size increases. Transformer-based language models rely on Softmax to compute attention scores, causing the attention distribution to…

Computation and Language · Computer Science 2025-02-03 Ken M. Nakanishi

From Pixels to Objects: Cubic Visual Attention for Visual Question Answering

Recently, attention-based Visual Question Answering (VQA) has achieved great success by utilizing question to selectively target different visual areas that are related to the answer. Existing visual attention models are generally planar,…

Computer Vision and Pattern Recognition · Computer Science 2022-06-07 Jingkuan Song , Pengpeng Zeng , Lianli Gao , Heng Tao Shen

An Improved Attention for Visual Question Answering

We consider the problem of Visual Question Answering (VQA). Given an image and a free-form, open-ended, question, expressed in natural language, the goal of VQA system is to provide accurate answer to this question with respect to the…

Computer Vision and Pattern Recognition · Computer Science 2021-06-07 Tanzila Rahman , Shih-Han Chou , Leonid Sigal , Giuseppe Carenini

Spatially Aware Multimodal Transformers for TextVQA

Textual cues are essential for everyday tasks like buying groceries and using public transport. To develop this assistive technology, we study the TextVQA task, i.e., reasoning about text in images to answer a question. Existing approaches…

Computer Vision and Pattern Recognition · Computer Science 2020-12-24 Yash Kant , Dhruv Batra , Peter Anderson , Alex Schwing , Devi Parikh , Jiasen Lu , Harsh Agrawal

Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering

Visual Question Answering (VQA) is challenging due to the complex cross-modal relations. It has received extensive attention from the research community. From the human perspective, to answer a visual question, one needs to read the…

Computer Vision and Pattern Recognition · Computer Science 2020-10-20 Hantao Huang , Tao Han , Wei Han , Deep Yap , Cheng-Ming Chiang

Efficient Attention via Control Variates

Random-feature-based attention (RFA) is an efficient approximation of softmax attention with linear runtime and space complexity. However, the approximation gap between RFA and conventional softmax attention is not well studied. Built upon…

Machine Learning · Computer Science 2023-02-10 Lin Zheng , Jianbo Yuan , Chong Wang , Lingpeng Kong

Exploring Human-like Attention Supervision in Visual Question Answering

Attention mechanisms have been widely applied in the Visual Question Answering (VQA) task, as they help to focus on the area-of-interest of both visual and textual information. To answer the questions correctly, the model needs to…

Computer Vision and Pattern Recognition · Computer Science 2017-09-20 Tingting Qiao , Jianfeng Dong , Duanqing Xu

Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

We conduct large-scale studies on `human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation…

Machine Learning · Statistics 2016-06-20 Abhishek Das , Harsh Agrawal , C. Lawrence Zitnick , Devi Parikh , Dhruv Batra