Related papers: Pyramid Coder: Hierarchical Code Generator for Com…

Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA

Recent advances in vision-language models (VLMs) have achieved impressive results on standard image-text tasks, yet their potential for visual procedure question answering (VP-QA) remains largely unexplored. VP-QA presents unique challenges…

Computation and Language · Computer Science 2026-05-15 Guanhua Chen , Yutong Yao , Shenghe Sun , Ci-Jun Gao , Shudong Liu , Lidia S. Chao , Feng Wan , Derek F. Wong

Modular Visual Question Answering via Code Generation

We present a framework that formulates visual question answering as modular code generation. In contrast to prior work on modular approaches to VQA, our approach requires no additional training and relies on pre-trained language models…

Computation and Language · Computer Science 2023-06-09 Sanjay Subramanian , Medhini Narasimhan , Kushal Khangaonkar , Kevin Yang , Arsha Nagrani , Cordelia Schmid , Andy Zeng , Trevor Darrell , Dan Klein

Survey of Recent Advances in Visual Question Answering

Visual Question Answering (VQA) presents a unique challenge as it requires the ability to understand and encode the multi-modal inputs - in terms of image processing and natural language processing. The algorithm further needs to learn how…

Computer Vision and Pattern Recognition · Computer Science 2017-09-26 Supriya Pandhre , Shagun Sodhani

Visual question answering: from early developments to recent advances -- a survey

Visual Question Answering (VQA) is an evolving research field aimed at enabling machines to answer questions about visual content by integrating image and language processing techniques such as feature extraction, object detection, text…

Computer Vision and Pattern Recognition · Computer Science 2025-01-14 Ngoc Dung Huynh , Mohamed Reda Bouadjenek , Sunil Aryal , Imran Razzak , Hakim Hacid

Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts

Visual question answering (VQA) is the task of answering questions about an image. The task assumes an understanding of both the image and the question to provide a natural language answer. VQA has gained popularity in recent years due to…

Computer Vision and Pattern Recognition · Computer Science 2023-11-01 Deepanway Ghosal , Navonil Majumder , Roy Ka-Wei Lee , Rada Mihalcea , Soujanya Poria

Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions

Visual Question Answering is a multi-modal task that aims to measure high-level visual understanding. Contemporary VQA models are restrictive in the sense that answers are obtained via classification over a limited vocabulary (in the case…

Computer Vision and Pattern Recognition · Computer Science 2021-06-18 Radhika Dua , Sai Srinivas Kancheti , Vineeth N Balasubramanian

An Analysis of Visual Question Answering Algorithms

In visual question answering (VQA), an algorithm must answer text-based questions about images. While multiple datasets for VQA have been created since late 2014, they all have flaws in both their content and the way algorithms are…

Computer Vision and Pattern Recognition · Computer Science 2017-09-15 Kushal Kafle , Christopher Kanan

TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains

In this paper, we establish a benchmark for table visual question answering, referred to as the TableVQA-Bench, derived from pre-existing table question-answering (QA) and table structure recognition datasets. It is important to note that…

Computer Vision and Pattern Recognition · Computer Science 2024-05-01 Yoonsik Kim , Moonbin Yim , Ka Yeon Song

Visual Question Answering based on Formal Logic

Visual question answering (VQA) has been gaining a lot of traction in the machine learning community in the recent years due to the challenges posed in understanding information coming from multiple modalities (i.e., images, language). In…

Computer Vision and Pattern Recognition · Computer Science 2021-11-11 Muralikrishnna G. Sethuraman , Ali Payani , Faramarz Fekri , J. Clayton Kerce

Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks

Having revolutionized natural language processing (NLP) applications, large language models (LLMs) are expanding into the realm of multimodal inputs. Owing to their ability to interpret images, multimodal LLMs (MLLMs) have been primarily…

Computer Vision and Pattern Recognition · Computer Science 2024-02-14 Jusung Lee , Sungguk Cha , Younghyun Lee , Cheoljong Yang

Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering

Recently, to comprehensively improve Vision Language Models (VLMs) for Visual Question Answering (VQA), several methods have been proposed to further reinforce the inference capabilities of VLMs to independently tackle VQA tasks rather than…

Computer Vision and Pattern Recognition · Computer Science 2025-02-17 Zeqing Wang , Wentao Wan , Qiqing Lao , Runmeng Chen , Minjie Lang , Xiao Wang , Keze Wang , Liang Lin

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

Zero-shot Visual Question Answering (VQA) is a prominent vision-language task that examines both the visual and textual understanding capability of systems in the absence of training data. Recently, by converting the images into captions,…

Computer Vision and Pattern Recognition · Computer Science 2023-11-16 Yunshi Lan , Xiang Li , Xin Liu , Yang Li , Wei Qin , Weining Qian

Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey

Visual Question Answering (VQA) is a challenge task that combines natural language processing and computer vision techniques and gradually becomes a benchmark test task in multimodal large language models (MLLMs). The goal of our survey is…

Computation and Language · Computer Science 2024-11-27 Jiayi Kuang , Jingyou Xie , Haohao Luo , Ronghao Li , Zhe Xu , Xianfeng Cheng , Yinghui Li , Xika Lin , Ying Shen

MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering

Visual Question Answering (VQA) requires reasoning across visual and textual modalities, yet Large Vision-Language Models (LVLMs) often lack integrated commonsense knowledge, limiting their robustness in real-world scenarios. To address…

Computation and Language · Computer Science 2025-06-12 Shuo Yang , Siwen Luo , Soyeon Caren Han , Eduard Hovy

Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation

With advances in multimodal research and deep learning, Multimodal Large Language Models (MLLMs) have emerged as a powerful paradigm for a wide range of multimodal tasks. As a core problem in vision-language research, Visual Question…

Computer Vision and Pattern Recognition · Computer Science 2026-05-06 Quanxing Xu , Ling Zhou , Xian Zhong , Xiaohua Huang , Rubing Huang , Chia-Wen Lin

AdaCoder: Adaptive Prompt Compression for Programmatic Visual Question Answering

Visual question answering aims to provide responses to natural language questions given visual input. Recently, visual programmatic models (VPMs), which generate executable programs to answer questions through large language models (LLMs),…

Artificial Intelligence · Computer Science 2024-07-30 Mahiro Ukai , Shuhei Kurita , Atsushi Hashimoto , Yoshitaka Ushiku , Nakamasa Inoue

Visual Question Answering: A Survey of Methods and Datasets

Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. Given an image and a question in natural language, it requires…

Computer Vision and Pattern Recognition · Computer Science 2016-07-21 Qi Wu , Damien Teney , Peng Wang , Chunhua Shen , Anthony Dick , Anton van den Hengel

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

We introduce GQA, a new dataset for real-world visual reasoning and compositional question answering, seeking to address key shortcomings of previous VQA datasets. We have developed a strong and robust question engine that leverages scene…

Computation and Language · Computer Science 2019-07-12 Drew A. Hudson , Christopher D. Manning

VQA-Levels: A Hierarchical Approach for Classifying Questions in VQA

Designing datasets for Visual Question Answering (VQA) is a difficult and complex task that requires NLP for parsing and computer vision for analysing the relevant aspects of the image for answering the question asked. Several benchmark…

Computer Vision and Pattern Recognition · Computer Science 2025-02-06 Madhuri Latha Madaka , Chakravarthy Bhagvati

Visuo-Linguistic Question Answering (VLQA) Challenge

Understanding images and text together is an important aspect of cognition and building advanced Artificial Intelligence (AI) systems. As a community, we have achieved good benchmarks over language and vision domains separately, however…

Computer Vision and Pattern Recognition · Computer Science 2020-11-19 Shailaja Keyur Sampat , Yezhou Yang , Chitta Baral