Related papers: Multi-Modal Answer Validation for Knowledge-Based …

Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering

Fact-based Visual Question Answering (FVQA) requires external knowledge beyond visible content to answer questions about an image, which is challenging but indispensable to achieve general VQA. One limitation of existing FVQA solutions is…

Computer Vision and Pattern Recognition · Computer Science 2020-11-05 Zihao Zhu , Jing Yu , Yujing Wang , Yajing Sun , Yue Hu , Qi Wu

MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. However, retrieved knowledge is often noisy, partially irrelevant, or misaligned with the…

Computer Vision and Pattern Recognition · Computer Science 2026-02-19 Xianwei Mao , Kai Ye , Sheng Zhou , Nan Zhang , Haikuan Huang , Bin Li , Jiajun Bu

Interpretable Visual Question Answering Referring to Outside Knowledge

We present a novel multimodal interpretable VQA model that can answer the question more accurately and generate diverse explanations. Although researchers have proposed several methods that can generate human-readable and fine-grained…

Computer Vision and Pattern Recognition · Computer Science 2023-03-09 He Zhu , Ren Togo , Takahiro Ogawa , Miki Haseyama

Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image. This ability is challenging but indispensable to achieve general VQA. One limitation of existing…

Artificial Intelligence · Computer Science 2020-11-04 Jing Yu , Zihao Zhu , Yujing Wang , Weifeng Zhang , Yue Hu , Jianlong Tan

Uncertainty-based Visual Question Answering: Estimating Semantic Inconsistency between Image and Knowledge Base

Knowledge-based visual question answering (KVQA) task aims to answer questions that require additional external knowledge as well as an understanding of images and questions. Recent studies on KVQA inject an external knowledge in a…

Computer Vision and Pattern Recognition · Computer Science 2022-07-28 Jinyeong Chae , Jihie Kim

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

Visual Question Answering (VQA) in its ideal form lets us study reasoning in the joint space of vision and language and serves as a proxy for the AI task of scene understanding. However, most VQA benchmarks to date are focused on questions…

Computer Vision and Pattern Recognition · Computer Science 2019-09-05 Kenneth Marino , Mohammad Rastegari , Ali Farhadi , Roozbeh Mottaghi

MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering

Knowledge-based visual question answering requires the ability of associating external knowledge for open-ended cross-modal scene understanding. One limitation of existing solutions is that they capture relevant knowledge from text-only…

Computer Vision and Pattern Recognition · Computer Science 2022-03-18 Yang Ding , Jing Yu , Bang Liu , Yue Hu , Mingxin Cui , Qi Wu

Knowledge Detection by Relevant Question and Image Attributes in Visual Question Answering

Visual question answering (VQA) is a Multidisciplinary research problem that pursued through practices of natural language processing and computer vision. Visual question answering automatically answers natural language questions according…

Computer Vision and Pattern Recognition · Computer Science 2024-09-01 Param Ahir , Hiteishi Diwanji

Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks

Visual Question Answering (VQA) has attracted much attention since it offers insight into the relationships between the multi-modal analysis of images and natural language. Most of the current algorithms are incapable of answering…

Computer Vision and Pattern Recognition · Computer Science 2017-12-05 Guohao Li , Hang Su , Wenwu Zhu

Combo of Thinking and Observing for Outside-Knowledge VQA

Outside-knowledge visual question answering is a challenging task that requires both the acquisition and the use of open-ended real-world knowledge. Some existing solutions draw external knowledge into the cross-modality space which…

Computer Vision and Pattern Recognition · Computer Science 2023-05-12 Qingyi Si , Yuchen Mo , Zheng Lin , Huishan Ji , Weiping Wang

Multimodal Rationales for Explainable Visual Question Answering

Visual Question Answering (VQA) is a challenging task of predicting the answer to a question about the content of an image. Prior works directly evaluate the answering models by simply calculating the accuracy of predicted answers. However,…

Computer Vision and Pattern Recognition · Computer Science 2025-06-11 Kun Li , George Vosselman , Michael Ying Yang

A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA

Knowledge-based Visual Question Answering (VQA) expects models to rely on external knowledge for robust answer prediction. Though significant it is, this paper discovers several leading factors impeding the advancement of current…

Computer Vision and Pattern Recognition · Computer Science 2022-07-01 Yangyang Guo , Liqiang Nie , Yongkang Wong , Yibing Liu , Zhiyong Cheng , Mohan Kankanhalli

Visual Question Answering as Reading Comprehension

Visual question answering (VQA) demands simultaneous comprehension of both the image visual content and natural language questions. In some cases, the reasoning needs the help of common sense or general knowledge which usually appear in the…

Computer Vision and Pattern Recognition · Computer Science 2018-11-30 Hui Li , Peng Wang , Chunhua Shen , Anton van den Hengel

Multimodal Reranking for Knowledge-Intensive Visual Question Answering

Knowledge-intensive visual question answering requires models to effectively use external knowledge to help answer visual questions. A typical pipeline includes a knowledge retriever and an answer generator. However, a retriever that…

Computation and Language · Computer Science 2024-07-18 Haoyang Wen , Honglei Zhuang , Hamed Zamani , Alexander Hauptmann , Michael Bendersky

EKTVQA: Generalized use of External Knowledge to empower Scene Text in Text-VQA

The open-ended question answering task of Text-VQA often requires reading and reasoning about rarely seen or completely unseen scene-text content of an image. We address this zero-shot nature of the problem by proposing the generalized use…

Computer Vision and Pattern Recognition · Computer Science 2022-07-18 Arka Ujjal Dey , Ernest Valveny , Gaurav Harit

Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources

We propose a method for visual question answering which combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions. This allows…

Computer Vision and Pattern Recognition · Computer Science 2016-04-15 Qi Wu , Peng Wang , Chunhua Shen , Anthony Dick , Anton van den Hengel

Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA

We study the Knowledge-Based visual question-answering problem, for which given a question, the models need to ground it into the visual modality to find the answer. Although many recent works use question-dependent captioners to verbalize…

Artificial Intelligence · Computer Science 2024-06-28 Elham J. Barezi , Parisa Kordjamshidi

Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering

We present a new pre-training method, Multimodal Inverse Cloze Task, for Knowledge-based Visual Question Answering about named Entities (KVQAE). KVQAE is a recently introduced task that consists in answering questions about named entities…

Computation and Language · Computer Science 2023-01-12 Paul Lerner , Olivier Ferret , Camille Guinaudeau

Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering

Knowledge-based visual question answering (VQA) requires answering questions with external knowledge in addition to the content of images. One dataset that is mostly used in evaluating knowledge-based VQA is OK-VQA, but it lacks a gold…

Computation and Language · Computer Science 2021-09-10 Man Luo , Yankai Zeng , Pratyay Banerjee , Chitta Baral

Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey

Visual Question Answering (VQA) is a challenge task that combines natural language processing and computer vision techniques and gradually becomes a benchmark test task in multimodal large language models (MLLMs). The goal of our survey is…

Computation and Language · Computer Science 2024-11-27 Jiayi Kuang , Jingyou Xie , Haohao Luo , Ronghao Li , Zhe Xu , Xianfeng Cheng , Yinghui Li , Xika Lin , Ying Shen