English
Related papers

Related papers: Multiple-Question Multiple-Answer Text-VQA

200 papers

Many visual scenes contain text that carries crucial information, and it is thus essential to understand text in images for downstream reasoning tasks. For example, a deep water label on a warning sign warns people about the danger in the…

Computer Vision and Pattern Recognition · Computer Science 2020-03-26 Ronghang Hu , Amanpreet Singh , Trevor Darrell , Marcus Rohrbach

Visual Question Answering (VQA) presents a unique challenge as it requires the ability to understand and encode the multi-modal inputs - in terms of image processing and natural language processing. The algorithm further needs to learn how…

Computer Vision and Pattern Recognition · Computer Science 2017-09-26 Supriya Pandhre , Shagun Sodhani

The increasing availability of multimodal data across text, tables, and images presents new challenges for developing models capable of complex cross-modal reasoning. Existing methods for Multimodal Multi-hop Question Answering (MMQA) often…

Computer Vision and Pattern Recognition · Computer Science 2025-04-14 Qi Zhi Lim , Chin Poo Lee , Kian Ming Lim , Kalaiarasi Sonai Muthu Anbananthen

Visual Question Answering (VQA) is a challenging task of predicting the answer to a question about the content of an image. Prior works directly evaluate the answering models by simply calculating the accuracy of predicted answers. However,…

Computer Vision and Pattern Recognition · Computer Science 2025-06-11 Kun Li , George Vosselman , Michael Ying Yang

A trending paradigm for multiple-choice question answering (MCQA) is using a text-to-text framework. By unifying data in different tasks into a single text-to-text format, it trains a generative encoder-decoder model which is both powerful…

Computation and Language · Computer Science 2022-05-03 Zixian Huang , Ao Wu , Jiaying Zhou , Yu Gu , Yue Zhao , Gong Cheng

Visual Question Answering (VQA) becomes one of the most active research problems in the medical imaging domain. A well-known VQA challenge is the intrinsic diversity between the image and text modalities, and in the medical VQA task, there…

Computer Vision and Pattern Recognition · Computer Science 2023-02-28 Yuan Zhou , Jing Mei , Yiqin Yu , Tanveer Syeda-Mahmood

Medical Visual Question Answering (VQA) is a multi-modal challenging task widely considered by research communities of the computer vision and natural language processing. Since most current medical VQA models focus on visual content,…

Computer Vision and Pattern Recognition · Computer Science 2021-07-08 Haiwei Pan , Shuning He , Kejia Zhang , Bo Qu , Chunling Chen , Kun Shi

Document Visual Question Answering (DocVQA) refers to the task of answering questions from document images. Existing work on DocVQA only considers single-page documents. However, in real scenarios documents are mostly composed of multiple…

Computer Vision and Pattern Recognition · Computer Science 2023-04-05 Rubèn Tito , Dimosthenis Karatzas , Ernest Valveny

This paper presents an in-depth study of multimodal machine translation (MMT), examining the prevailing understanding that MMT systems exhibit decreased sensitivity to visual information when text inputs are complete. Instead, we attribute…

Computation and Language · Computer Science 2023-10-27 Yuxin Zuo , Bei Li , Chuanhao Lv , Tong Zheng , Tong Xiao , Jingbo Zhu

We present MCQA, a learning-based algorithm for multimodal question answering. MCQA explicitly fuses and aligns the multimodal input (i.e. text, audio, and video), which forms the context for the query (question and answer). Our approach…

Computation and Language · Computer Science 2020-04-28 Abhishek Kumar , Trisha Mittal , Dinesh Manocha

When answering complex questions, people can seamlessly combine information from visual, textual and tabular sources. While interest in models that reason over multiple pieces of evidence has surged in recent years, there has been…

Computation and Language · Computer Science 2021-04-14 Alon Talmor , Ori Yoran , Amnon Catav , Dan Lahav , Yizhong Wang , Akari Asai , Gabriel Ilharco , Hannaneh Hajishirzi , Jonathan Berant

Visual Question Answering (VQA) has been primarily studied through the lens of the English language. Yet, tackling VQA in other languages in the same manner would require a considerable amount of resources. In this paper, we propose…

Computation and Language · Computer Science 2023-10-25 Soravit Changpinyo , Linting Xue , Michal Yarom , Ashish V. Thapliyal , Idan Szpektor , Julien Amelot , Xi Chen , Radu Soricut

Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine-tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a…

Computation and Language · Computer Science 2026-04-22 Krishna Singh Rajput , Tejas Anvekar , Chitta Baral , Vivek Gupta

Textbook Question Answering (TQA) is a complex multimodal task to infer answers given large context descriptions and abundant diagrams. Compared with Visual Question Answering (VQA), TQA contains a large number of uncommon terminologies and…

Multimedia · Computer Science 2021-12-07 Fangzhi Xu , Qika Lin , Jun Liu , Lingling Zhang , Tianzhe Zhao , Qi Chai , Yudai Pan

Visual question answering (VQA) in medical imaging aims to support clinical diagnosis by automatically interpreting complex imaging data in response to natural language queries. Existing studies typically rely on distinct visual and textual…

Computer Vision and Pattern Recognition · Computer Science 2025-07-08 Yuanhe Tian , Chen Su , Junwen Duan , Yan Song

Multi-hop Question Answering (MHQA) adds layers of complexity to question answering, making it more challenging. When Language Models (LMs) are prompted with multiple search results, they are tasked not only with retrieving relevant…

Computation and Language · Computer Science 2025-05-20 Wenyu Huang , Pavlos Vougiouklis , Mirella Lapata , Jeff Z. Pan

Multiple-choice question answering (MCQA) is a key competence of performant transformer language models that is tested by mainstream benchmarks. However, recent evidence shows that models can have quite a range of performance, particularly…

Computation and Language · Computer Science 2025-03-11 Sarah Wiegreffe , Oyvind Tafjord , Yonatan Belinkov , Hannaneh Hajishirzi , Ashish Sabharwal

Visual question answering (VQA) demands simultaneous comprehension of both the image visual content and natural language questions. In some cases, the reasoning needs the help of common sense or general knowledge which usually appear in the…

Computer Vision and Pattern Recognition · Computer Science 2018-11-30 Hui Li , Peng Wang , Chunhua Shen , Anton van den Hengel

The ideal form of Visual Question Answering requires understanding, grounding and reasoning in the joint space of vision and language and serves as a proxy for the AI task of scene understanding. However, most existing VQA benchmarks are…

Computer Vision and Pattern Recognition · Computer Science 2023-03-07 Kang Chen , Xiangqian Wu

Document Visual Question Answering (DocVQA) requires models to jointly understand textual semantics, spatial layout, and visual features. Current methods struggle with explicit spatial relationship modeling, inefficiency with…

Computer Vision and Pattern Recognition · Computer Science 2025-11-25 Ahmad Mohammadshirazi , Pinaki Prasad Guha Neogi , Dheeraj Kulshrestha , Rajiv Ramnath
‹ Prev 1 2 3 10 Next ›