Related papers: Visual Set Program Synthesizer

Synthesizing Visual Concepts as Vision-Language Programs

Vision-Language models (VLMs) achieve strong performance on multimodal tasks but often fail at systematic visual reasoning tasks, leading to inconsistent or illogical outputs. Neuro-symbolic methods promise to address this by inducing…

Artificial Intelligence · Computer Science 2025-11-25 Antonia Wüst , Wolfgang Stammer , Hikaru Shindo , Lukas Helff , Devendra Singh Dhami , Kristian Kersting

From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis

We explore multi-step reasoning in vision-language models (VLMs). The problem is challenging, as reasoning data consisting of multiple steps of visual and language processing are barely available. To overcome the challenge, we first…

Computation and Language · Computer Science 2024-10-14 Chuanqi Cheng , Jian Guan , Wei Wu , Rui Yan

Object-based reasoning in VQA

Visual Question Answering (VQA) is a novel problem domain where multi-modal inputs must be processed in order to solve the task given in the form of a natural language. As the solutions inherently require to combine visual and natural…

Computer Vision and Pattern Recognition · Computer Science 2018-01-31 Mikyas T. Desta , Larry Chen , Tomasz Kornuta

Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"

Visual reasoning tasks such as visual question answering (VQA) require an interplay of visual perception with reasoning about the question semantics grounded in perception. However, recent advances in this area are still primarily driven by…

Machine Learning · Computer Science 2020-08-27 Saeed Amizadeh , Hamid Palangi , Oleksandr Polozov , Yichen Huang , Kazuhito Koishida

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by…

Computer Vision and Pattern Recognition · Computer Science 2024-04-08 Yushi Hu , Otilia Stretcu , Chun-Ta Lu , Krishnamurthy Viswanathan , Kenji Hata , Enming Luo , Ranjay Krishna , Ariel Fuxman

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

We introduce GQA, a new dataset for real-world visual reasoning and compositional question answering, seeking to address key shortcomings of previous VQA datasets. We have developed a strong and robust question engine that leverages scene…

Computation and Language · Computer Science 2019-07-12 Drew A. Hudson , Christopher D. Manning

Explicit Reasoning over End-to-End Neural Architectures for Visual Question Answering

Many vision and language tasks require commonsense reasoning beyond data-driven image and natural language processing. Here we adopt Visual Question Answering (VQA) as an example task, where a system is expected to answer a question in…

Computer Vision and Pattern Recognition · Computer Science 2018-03-26 Somak Aditya , Yezhou Yang , Chitta Baral

Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and…

Computer Vision and Pattern Recognition · Computer Science 2024-05-16 Aleksandar Stanić , Sergi Caelles , Michael Tschannen

VisTIRA: Closing the Image-Text Modality Gap in Visual Math Reasoning via Structured Tool Integration

Vision-language models (VLMs) lag behind text-only language models on mathematical reasoning when the same problems are presented as images rather than text. We empirically characterize this as a modality gap: the same question in text form…

Artificial Intelligence · Computer Science 2026-03-18 Saeed Khaki , Ashudeep Singh , Nima Safaei , Kamal Ginotra

V-PROM: A Benchmark for Visual Reasoning Using Visual Progressive Matrices

One of the primary challenges faced by deep learning is the degree to which current methods exploit superficial statistics and dataset bias, rather than learning to generalise over the specific representations they have experienced. This is…

Computer Vision and Pattern Recognition · Computer Science 2019-07-30 Damien Teney , Peng Wang , Jiewei Cao , Lingqiao Liu , Chunhua Shen , Anton van den Hengel

Probing Visual Language Priors in VLMs

Despite recent advances in Vision-Language Models (VLMs), they may over-rely on visual language priors existing in their training data rather than true visual reasoning. To investigate this, we introduce ViLP, a benchmark featuring…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Tiange Luo , Ang Cao , Gunhee Lee , Justin Johnson , Honglak Lee

Visuo-Linguistic Question Answering (VLQA) Challenge

Understanding images and text together is an important aspect of cognition and building advanced Artificial Intelligence (AI) systems. As a community, we have achieved good benchmarks over language and vision domains separately, however…

Computer Vision and Pattern Recognition · Computer Science 2020-11-19 Shailaja Keyur Sampat , Yezhou Yang , Chitta Baral

SHOP-VRB: A Visual Reasoning Benchmark for Object Perception

In this paper we present an approach and a benchmark for visual reasoning in robotics applications, in particular small object grasping and manipulation. The approach and benchmark are focused on inferring object properties from visual and…

Computer Vision and Pattern Recognition · Computer Science 2020-04-07 Michal Nazarczuk , Krystian Mikolajczyk

Object-Centric Diagnosis of Visual Reasoning

When answering questions about an image, it not only needs knowing what -- understanding the fine-grained contents (e.g., objects, relationships) in the image, but also telling why -- reasoning over grounding visual cues to derive the…

Computer Vision and Pattern Recognition · Computer Science 2020-12-22 Jianwei Yang , Jiayuan Mao , Jiajun Wu , Devi Parikh , David D. Cox , Joshua B. Tenenbaum , Chuang Gan

Inferring and Executing Programs for Visual Reasoning

Existing methods for visual reasoning attempt to directly map inputs to outputs using black-box architectures without explicitly modeling the underlying reasoning processes. As a result, these black-box models often learn to exploit biases…

Computer Vision and Pattern Recognition · Computer Science 2017-05-11 Justin Johnson , Bharath Hariharan , Laurens van der Maaten , Judy Hoffman , Li Fei-Fei , C. Lawrence Zitnick , Ross Girshick

Hidden in Plain Sight: Visual-to-Symbolic Analytical Solution Inference from Field Visualizations

Recovering analytical solutions of physical fields from visual observations is a fundamental yet underexplored capability for AI-assisted scientific reasoning. We study visual-to-symbolic analytical solution inference (ViSA) for…

Artificial Intelligence · Computer Science 2026-04-13 Pengze Li , Jiaquan Zhang , Yunbo Long , Xinping Liu , Zhou wenjie , Encheng Su , Zihang Zeng , Jiaqi Liu , Jiyao Liu , Junchi Yu , Lihao Liu , Philip Torr , Shixiang Tang , Aoran Wang , Xi Chen

Visual Graph Question Answering with ASP and LLMs for Language Parsing

Visual Question Answering (VQA) is a challenging problem that requires to process multimodal input. Answer-Set Programming (ASP) has shown great potential in this regard to add interpretability and explainability to modular VQA…

Artificial Intelligence · Computer Science 2025-02-14 Jakob Johannes Bauer , Thomas Eiter , Nelson Higuera Ruiz , Johannes Oetsch

MathSight: A Benchmark Exploring Have Vision-Language Models Really Seen in University-Level Mathematical Reasoning?

Recent advances in Vision-Language Models (VLMs) have achieved impressive progress in multimodal mathematical reasoning. Yet, how much visual information truly contributes to reasoning remains unclear. Existing benchmarks report strong…

Computer Vision and Pattern Recognition · Computer Science 2025-12-01 Yuandong Wang , Yao Cui , Yuxin Zhao , Zhen Yang , Yangfu Zhu , Zhenzhou Shao

Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data

The impressive advances and applications of large language and joint language-and-visual understanding models has led to an increased need for methods of probing their potential reasoning capabilities. However, the difficulty of gather…

Machine Learning · Computer Science 2023-06-05 Nathan Vaska , Victoria Helus

VGR: Visual Grounded Reasoning

In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This…

Computer Vision and Pattern Recognition · Computer Science 2026-05-04 Jiacong Wang , Zijian Kang , Haochen Wang , Haiyong Jiang , Jiawen Li , Bohong Wu , Ya Wang , Jiao Ran , Xiao Liang , Chao Feng , Jun Xiao