English
Related papers

Related papers: Recursive Visual Programming

200 papers

We present a framework that formulates visual question answering as modular code generation. In contrast to prior work on modular approaches to VQA, our approach requires no additional training and relies on pre-trained language models…

Visual dialog is a challenging vision-language task, which requires the agent to answer multi-round questions about an image. It typically needs to address two major problems: (1) How to answer visually-grounded questions, which is the core…

Computer Vision and Pattern Recognition · Computer Science 2019-04-09 Yulei Niu , Hanwang Zhang , Manli Zhang , Jianhong Zhang , Zhiwu Lu , Ji-Rong Wen

This paper revisits visual representation in knowledge-based visual question answering (VQA) and demonstrates that using regional information in a better way can significantly improve the performance. While visual representation is…

Computer Vision and Pattern Recognition · Computer Science 2022-10-11 Yuanze Lin , Yujia Xie , Dongdong Chen , Yichong Xu , Chenguang Zhu , Lu Yuan

Large language model (LLM) agents are increasingly capable of orchestrating complex tasks in low-code environments. However, these agents often exhibit hallucinations and logical inconsistencies because their inherent reasoning mechanisms…

Artificial Intelligence · Computer Science 2025-10-09 Jiexi Xu , Jiaqi Liu , Lanruo Wang , Su Liu

Visual Question Answering (VQA) is a novel problem domain where multi-modal inputs must be processed in order to solve the task given in the form of a natural language. As the solutions inherently require to combine visual and natural…

Computer Vision and Pattern Recognition · Computer Science 2018-01-31 Mikyas T. Desta , Larry Chen , Tomasz Kornuta

Visual question answering (VQA) is a challenging multi-modal task that requires not only the semantic understanding of both images and questions, but also the sound perception of a step-by-step reasoning process that would lead to the…

Computer Vision and Pattern Recognition · Computer Science 2021-04-06 Siwen Luo , Soyeon Caren Han , Kaiyuan Sun , Josiah Poon

Spatial reasoning in 3D scenes requires precise geometric calculations that challenge vision-language models. Visual programming addresses this by decomposing problems into steps calling specialized tools, yet existing methods rely on…

Computer Vision and Pattern Recognition · Computer Science 2025-12-25 Shengguang Wu , Xiaohan Wang , Yuhui Zhang , Hao Zhu , Serena Yeung-Levy

Visual Parameter Space Analysis (VPSA) enables domain scientists to explore input-output relationships of computational models. Existing VPSA applications often feature multi-view visualizations designed by visualization experts for a…

Human-Computer Interaction · Computer Science 2024-09-12 Manfred Klaffenboeck , Michael Gleicher , Johannes Sorger , Michael Wimmer , Torsten Möller

Visual Question Answering (VQA) presents a unique challenge as it requires the ability to understand and encode the multi-modal inputs - in terms of image processing and natural language processing. The algorithm further needs to learn how…

Computer Vision and Pattern Recognition · Computer Science 2017-09-26 Supriya Pandhre , Shagun Sodhani

Answering open-ended questions is an essential capability for any intelligent agent. One of the most interesting recent open-ended question answering challenges is Visual Question Answering (VQA) which attempts to evaluate a system's visual…

Computation and Language · Computer Science 2016-10-25 Omid Bakhshandeh , Trung Bui , Zhe Lin , Walter Chang

Answering visual queries is a complex task that requires both visual processing and reasoning. End-to-end models, the dominant approach for this task, do not explicitly differentiate between the two, limiting interpretability and…

Computer Vision and Pattern Recognition · Computer Science 2023-03-15 Dídac Surís , Sachit Menon , Carl Vondrick

Model reprogramming adapts pretrained models to downstream tasks by modifying only the input and output spaces. Visual reprogramming (VR) is one instance for vision tasks that adds a trainable noise pattern (i.e., a visual prompt) to input…

Machine Learning · Computer Science 2025-06-03 Chengyi Cai , Zesheng Ye , Lei Feng , Jianzhong Qi , Feng Liu

Visual question answering (VQA) has traditionally been treated as a single-step task where each question receives the same amount of effort, unlike natural human question-answering strategies. We explore a question decomposition strategy…

Computer Vision and Pattern Recognition · Computer Science 2023-10-27 Zaid Khan , Vijay Kumar BG , Samuel Schulter , Manmohan Chandraker , Yun Fu

Recently, Visual Programming (VP) based on large language models (LLMs) has rapidly developed and demonstrated significant potential in complex Visual Reasoning (VR) tasks. Previous works to enhance VP have primarily focused on improving…

Computer Vision and Pattern Recognition · Computer Science 2025-12-17 Wentao Wan , Kaiyu Wu , Qingyang Ma , Nan Kang , Yunjie Chen , Liang Lin , Keze Wang

Raven's Progressive Matrices (RPMs) is an established benchmark to examine the ability to perform high-level abstract visual reasoning (AVR). Despite the current success of algorithms that solve this task, humans can generalize beyond a…

Artificial Intelligence · Computer Science 2025-04-01 Kalliopi Basioti , Pritish Sahu , Qingze Tony Liu , Zihao Xu , Hao Wang , Vladimir Pavlovic

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios,…

Computation and Language · Computer Science 2016-10-28 Aishwarya Agrawal , Jiasen Lu , Stanislaw Antol , Margaret Mitchell , C. Lawrence Zitnick , Dhruv Batra , Devi Parikh

In visual question answering (VQA), an algorithm must answer text-based questions about images. While multiple datasets for VQA have been created since late 2014, they all have flaws in both their content and the way algorithms are…

Computer Vision and Pattern Recognition · Computer Science 2017-09-15 Kushal Kafle , Christopher Kanan

Visual prompting (VP) is an emerging parameter-efficient fine-tuning approach to adapting pre-trained vision models to solve various downstream image-classification tasks. However, there has hitherto been little systematic study of the…

Computer Vision and Pattern Recognition · Computer Science 2024-03-12 Hsi-Ai Tsao , Lei Hsiung , Pin-Yu Chen , Sijia Liu , Tsung-Yi Ho

Visual understanding requires interpreting both natural scenes and the textual information that appears within them, motivating tasks such as Visual Question Answering (VQA). However, current VQA benchmarks overlook scenarios with visually…

Computer Vision and Pattern Recognition · Computer Science 2025-12-02 Jianing An , Luyang Jiang , Jie Luo , Wenjun Wu , Lei Huang

Recent advances in vision-language models (VLMs) have achieved impressive results on standard image-text tasks, yet their potential for visual procedure question answering (VP-QA) remains largely unexplored. VP-QA presents unique challenges…

Computation and Language · Computer Science 2026-05-15 Guanhua Chen , Yutong Yao , Shenghe Sun , Ci-Jun Gao , Shudong Liu , Lidia S. Chao , Feng Wan , Derek F. Wong
‹ Prev 1 2 3 10 Next ›