Related papers: Multiple-Question Multiple-Answer Text-VQA

Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Many visual scenes contain text that carries crucial information, and it is thus essential to understand text in images for downstream reasoning tasks. For example, a deep water label on a warning sign warns people about the danger in the…

Computer Vision and Pattern Recognition · Computer Science 2020-03-26 Ronghang Hu , Amanpreet Singh , Trevor Darrell , Marcus Rohrbach

Survey of Recent Advances in Visual Question Answering

Visual Question Answering (VQA) presents a unique challenge as it requires the ability to understand and encode the multi-modal inputs - in terms of image processing and natural language processing. The algorithm further needs to learn how…

Computer Vision and Pattern Recognition · Computer Science 2017-09-26 Supriya Pandhre , Shagun Sodhani

VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering

The increasing availability of multimodal data across text, tables, and images presents new challenges for developing models capable of complex cross-modal reasoning. Existing methods for Multimodal Multi-hop Question Answering (MMQA) often…

Computer Vision and Pattern Recognition · Computer Science 2025-04-14 Qi Zhi Lim , Chin Poo Lee , Kian Ming Lim , Kalaiarasi Sonai Muthu Anbananthen

Multimodal Rationales for Explainable Visual Question Answering

Visual Question Answering (VQA) is a challenging task of predicting the answer to a question about the content of an image. Prior works directly evaluate the answering models by simply calculating the accuracy of predicted answers. However,…

Computer Vision and Pattern Recognition · Computer Science 2025-06-11 Kun Li , George Vosselman , Michael Ying Yang

Clues Before Answers: Generation-Enhanced Multiple-Choice QA

A trending paradigm for multiple-choice question answering (MCQA) is using a text-to-text framework. By unifying data in different tasks into a single text-to-text format, it trains a generative encoder-decoder model which is both powerful…

Computation and Language · Computer Science 2022-05-03 Zixian Huang , Ao Wu , Jiaying Zhou , Yu Gu , Yue Zhao , Gong Cheng

Medical visual question answering using joint self-supervised learning

Visual Question Answering (VQA) becomes one of the most active research problems in the medical imaging domain. A well-known VQA challenge is the intrinsic diversity between the image and text modalities, and in the medical VQA task, there…

Computer Vision and Pattern Recognition · Computer Science 2023-02-28 Yuan Zhou , Jing Mei , Yiqin Yu , Tanveer Syeda-Mahmood

MuVAM: A Multi-View Attention-based Model for Medical Visual Question Answering

Medical Visual Question Answering (VQA) is a multi-modal challenging task widely considered by research communities of the computer vision and natural language processing. Since most current medical VQA models focus on visual content,…

Computer Vision and Pattern Recognition · Computer Science 2021-07-08 Haiwei Pan , Shuning He , Kejia Zhang , Bo Qu , Chunling Chen , Kun Shi

Hierarchical multimodal transformers for Multi-Page DocVQA

Document Visual Question Answering (DocVQA) refers to the task of answering questions from document images. Existing work on DocVQA only considers single-page documents. However, in real scenarios documents are mostly composed of multiple…

Computer Vision and Pattern Recognition · Computer Science 2023-04-05 Rubèn Tito , Dimosthenis Karatzas , Ernest Valveny

Incorporating Probing Signals into Multimodal Machine Translation via Visual Question-Answering Pairs

This paper presents an in-depth study of multimodal machine translation (MMT), examining the prevailing understanding that MMT systems exhibit decreased sensitivity to visual information when text inputs are complete. Instead, we attribute…

Computation and Language · Computer Science 2023-10-27 Yuxin Zuo , Bei Li , Chuanhao Lv , Tong Zheng , Tong Xiao , Jingbo Zhu

MCQA: Multimodal Co-attention Based Network for Question Answering

We present MCQA, a learning-based algorithm for multimodal question answering. MCQA explicitly fuses and aligns the multimodal input (i.e. text, audio, and video), which forms the context for the query (question and answer). Our approach…

Computation and Language · Computer Science 2020-04-28 Abhishek Kumar , Trisha Mittal , Dinesh Manocha

MultiModalQA: Complex Question Answering over Text, Tables and Images

When answering complex questions, people can seamlessly combine information from visual, textual and tabular sources. While interest in models that reason over multiple pieces of evidence has surged in recent years, there has been…

Computation and Language · Computer Science 2021-04-14 Alon Talmor , Ori Yoran , Amnon Catav , Dan Lahav , Yizhong Wang , Akari Asai , Gabriel Ilharco , Hannaneh Hajishirzi , Jonathan Berant

MaXM: Towards Multilingual Visual Question Answering

Visual Question Answering (VQA) has been primarily studied through the lens of the English language. Yet, tackling VQA in other languages in the same manner would require a considerable amount of resources. In this paper, we propose…

Computation and Language · Computer Science 2023-10-25 Soravit Changpinyo , Linting Xue , Michal Yarom , Ashish V. Thapliyal , Idan Szpektor , Julien Amelot , Xi Chen , Radu Soricut

Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine-tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a…

Computation and Language · Computer Science 2026-04-22 Krishna Singh Rajput , Tejas Anvekar , Chitta Baral , Vivek Gupta

MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided Multimodal Attention for Textbook Question Answering

Textbook Question Answering (TQA) is a complex multimodal task to infer answers given large context descriptions and abundant diagrams. Compared with Visual Question Answering (VQA), TQA contains a large number of uncommon terminologies and…

Multimedia · Computer Science 2021-12-07 Fangzhi Xu , Qika Lin , Jun Liu , Lingling Zhang , Tianzhe Zhao , Qi Chai , Yudai Pan

Computed Tomography Visual Question Answering with Cross-modal Feature Graphing

Visual question answering (VQA) in medical imaging aims to support clinical diagnosis by automatically interpreting complex imaging data in response to natural language queries. Existing studies typically rely on distinct visual and textual…

Computer Vision and Pattern Recognition · Computer Science 2025-07-08 Yuanhe Tian , Chen Su , Junwen Duan , Yan Song

Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation

Multi-hop Question Answering (MHQA) adds layers of complexity to question answering, making it more challenging. When Language Models (LMs) are prompted with multiple search results, they are tasked not only with retrieving relevant…

Computation and Language · Computer Science 2025-05-20 Wenyu Huang , Pavlos Vougiouklis , Mirella Lapata , Jeff Z. Pan

Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions

Multiple-choice question answering (MCQA) is a key competence of performant transformer language models that is tested by mainstream benchmarks. However, recent evidence shows that models can have quite a range of performance, particularly…

Computation and Language · Computer Science 2025-03-11 Sarah Wiegreffe , Oyvind Tafjord , Yonatan Belinkov , Hannaneh Hajishirzi , Ashish Sabharwal

Visual Question Answering as Reading Comprehension

Visual question answering (VQA) demands simultaneous comprehension of both the image visual content and natural language questions. In some cases, the reasoning needs the help of common sense or general knowledge which usually appear in the…

Computer Vision and Pattern Recognition · Computer Science 2018-11-30 Hui Li , Peng Wang , Chunhua Shen , Anton van den Hengel

VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning

The ideal form of Visual Question Answering requires understanding, grounding and reasoning in the joint space of vision and language and serves as a proxy for the AI task of scene understanding. However, most existing VQA benchmarks are…

Computer Vision and Pattern Recognition · Computer Science 2023-03-07 Kang Chen , Xiangqian Wu

MGA-VQA: Secure and Interpretable Graph-Augmented Visual Question Answering with Memory-Guided Protection Against Unauthorized Knowledge Use

Document Visual Question Answering (DocVQA) requires models to jointly understand textual semantics, spatial layout, and visual features. Current methods struggle with explicit spatial relationship modeling, inefficiency with…

Computer Vision and Pattern Recognition · Computer Science 2025-11-25 Ahmad Mohammadshirazi , Pinaki Prasad Guha Neogi , Dheeraj Kulshrestha , Rajiv Ramnath