Related papers: Curriculum Learning for Compositional Visual Reaso…

Multimodal Representations for Teacher-Guided Compositional Visual Reasoning

Neural Module Networks (NMN) are a compelling method for visual question answering, enabling the translation of a question into a program consisting of a series of reasoning sub-tasks that are sequentially executed on the image to produce…

Computation and Language · Computer Science 2023-10-25 Wafa Aissa , Marin Ferecatu , Michel Crucianu

CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering

Visual Question Answering (VQA) is a multi-discipline research task. To produce the right answer, it requires an understanding of the visual content of images, the natural language questions, as well as commonsense reasoning over the…

Computer Vision and Pattern Recognition · Computer Science 2022-11-22 Yao Zhang , Haokun Chen , Ahmed Frikha , Yezi Yang , Denis Krompass , Gengyuan Zhang , Jindong Gu , Volker Tresp

Learning Visual Knowledge Memory Networks for Visual Question Answering

Visual question answering (VQA) requires joint comprehension of images and natural language questions, where many questions can't be directly or clearly answered from visual content but require reasoning from structured human knowledge with…

Computer Vision and Pattern Recognition · Computer Science 2018-06-14 Zhou Su , Chen Zhu , Yinpeng Dong , Dongqi Cai , Yurong Chen , Jianguo Li

Show Why the Answer is Correct! Towards Explainable AI using Compositional Temporal Attention

Visual Question Answering (VQA) models have achieved significant success in recent times. Despite the success of VQA models, they are mostly black-box models providing no reasoning about the predicted answer, thus raising questions for…

Computer Vision and Pattern Recognition · Computer Science 2021-05-18 Nihar Bendre , Kevin Desai , Peyman Najafirad

Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual Question Answering

Visual Question Answering (VQA) has emerged as one of the most challenging tasks in artificial intelligence due to its multi-modal nature. However, most existing VQA methods are incapable of handling Knowledge-based Visual Question…

Computer Vision and Pattern Recognition · Computer Science 2023-12-21 Chengxiang Yin , Zhengping Che , Kun Wu , Zhiyuan Xu , Jian Tang

Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey

Visual Question Answering (VQA) is a challenge task that combines natural language processing and computer vision techniques and gradually becomes a benchmark test task in multimodal large language models (MLLMs). The goal of our survey is…

Computation and Language · Computer Science 2024-11-27 Jiayi Kuang , Jingyou Xie , Haohao Luo , Ronghao Li , Zhe Xu , Xianfeng Cheng , Yinghui Li , Xika Lin , Ying Shen

Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"

Visual reasoning tasks such as visual question answering (VQA) require an interplay of visual perception with reasoning about the question semantics grounded in perception. However, recent advances in this area are still primarily driven by…

Machine Learning · Computer Science 2020-08-27 Saeed Amizadeh , Hamid Palangi , Oleksandr Polozov , Yichen Huang , Kazuhito Koishida

Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks

Having revolutionized natural language processing (NLP) applications, large language models (LLMs) are expanding into the realm of multimodal inputs. Owing to their ability to interpret images, multimodal LLMs (MLLMs) have been primarily…

Computer Vision and Pattern Recognition · Computer Science 2024-02-14 Jusung Lee , Sungguk Cha , Younghyun Lee , Cheoljong Yang

From Shallow to Deep: Compositional Reasoning over Graphs for Visual Question Answering

In order to achieve a general visual question answering (VQA) system, it is essential to learn to answer deeper questions that require compositional reasoning on the image and external knowledge. Meanwhile, the reasoning process should be…

Computer Vision and Pattern Recognition · Computer Science 2022-06-28 Zihao Zhu

Improving Numerical Reasoning Skills in the Modular Approach for Complex Question Answering on Text

Numerical reasoning skills are essential for complex question answering (CQA) over text. It requires opertaions including counting, comparison, addition and subtraction. A successful approach to CQA on text, Neural Module Networks (NMNs),…

Computation and Language · Computer Science 2021-09-07 Xiao-Yu Guo , Yuan-Fang Li , Gholamreza Haffari

Conformal Cross-Modal Active Learning

Foundation models for vision have transformed visual recognition with powerful pretrained representations and strong zero-shot capabilities, yet their potential for data-efficient learning remains largely untapped. Active Learning (AL) aims…

Computer Vision and Pattern Recognition · Computer Science 2026-03-27 Huy Hoang Nguyen , Cédric Jung , Shirin Salehi , Tobias Glück , Anke Schmeink , Andreas Kugi

How Modular Should Neural Module Networks Be for Systematic Generalization?

Neural Module Networks (NMNs) aim at Visual Question Answering (VQA) via composition of modules that tackle a sub-task. NMNs are a promising strategy to achieve systematic generalization, i.e., overcoming biasing factors in the training…

Machine Learning · Computer Science 2022-01-19 Vanessa D'Amario , Tomotake Sasaki , Xavier Boix

Visual Question Reasoning on General Dependency Tree

The collaborative reasoning for understanding each image-question pair is very critical but under-explored for an interpretable Visual Question Answering (VQA) system. Although very recent works also tried the explicit compositional…

Computer Vision and Pattern Recognition · Computer Science 2018-04-03 Qingxing Cao , Xiaodan Liang , Bailing Li , Guanbin Li , Liang Lin

Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning

Emerging multimodal large language models (MLLMs) exhibit great potential for chart question answering (CQA). Recent efforts primarily focus on scaling up training datasets (i.e., charts, data tables, and question-answer (QA) pairs) through…

Computer Vision and Pattern Recognition · Computer Science 2024-08-13 Xingchen Zeng , Haichuan Lin , Yilin Ye , Wei Zeng

LLM-Assisted Multi-Teacher Continual Learning for Visual Question Answering in Robotic Surgery

Visual question answering (VQA) is crucial for promoting surgical education. In practice, the needs of trainees are constantly evolving, such as learning more surgical types, adapting to different robots, and learning new surgical…

Information Retrieval · Computer Science 2024-10-24 Yuyang Du , Kexin Chen , Yue Zhan , Chang Han Low , Tao You , Mobarakol Islam , Ziyu Guo , Yueming Jin , Guangyong Chen , Pheng-Ann Heng

Multimodal Commonsense Knowledge Distillation for Visual Question Answering

Existing Multimodal Large Language Models (MLLMs) and Visual Language Pretrained Models (VLPMs) have shown remarkable performances in the general Visual Question Answering (VQA). However, these models struggle with VQA questions that…

Computation and Language · Computer Science 2024-11-06 Shuo Yang , Siwen Luo , Soyeon Caren Han

Selectively Answering Visual Questions

Recently, large multi-modal models (LMMs) have emerged with the capacity to perform vision tasks such as captioning and visual question answering (VQA) with unprecedented accuracy. Applications such as helping the blind or visually impaired…

Computation and Language · Computer Science 2024-06-04 Julian Martin Eisenschlos , Hernán Maina , Guido Ivetta , Luciana Benotti

Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets

Visual Question Answering (VQA) is the task of answering a question about an image and requires processing multimodal input and reasoning to obtain the answer. Modular solutions that use declarative representations within the reasoning…

Artificial Intelligence · Computer Science 2024-10-15 Thomas Eiter , Jan Hadl , Nelson Higuera , Johannes Oetsch

VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks

Neural module networks (NMN) have achieved success in image-grounded tasks such as Visual Question Answering (VQA) on synthetic images. However, very limited work on NMN has been studied in the video-grounded dialogue tasks. These tasks…

Computer Vision and Pattern Recognition · Computer Science 2022-06-14 Hung Le , Nancy F. Chen , Steven C. H. Hoi

OC-NMN: Object-centric Compositional Neural Module Network for Generative Visual Analogical Reasoning

A key aspect of human intelligence is the ability to imagine -- composing learned concepts in novel ways -- to make sense of new scenarios. Such capacity is not yet attained for machine learning systems. In this work, in the context of…

Artificial Intelligence · Computer Science 2023-10-31 Rim Assouel , Pau Rodriguez , Perouz Taslakian , David Vazquez , Yoshua Bengio