Related papers: Benchmark Visual Question Answer Models by using F…

Inferring and Executing Programs for Visual Reasoning

Existing methods for visual reasoning attempt to directly map inputs to outputs using black-box architectures without explicitly modeling the underlying reasoning processes. As a result, these black-box models often learn to exploit biases…

Computer Vision and Pattern Recognition · Computer Science 2017-05-11 Justin Johnson , Bharath Hariharan , Laurens van der Maaten , Judy Hoffman , Li Fei-Fei , C. Lawrence Zitnick , Ross Girshick

CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions

Referring object detection and referring image segmentation are important tasks that require joint understanding of visual information and natural language. Yet there has been evidence that current benchmark datasets suffer from bias, and…

Computer Vision and Pattern Recognition · Computer Science 2019-04-09 Runtao Liu , Chenxi Liu , Yutong Bai , Alan Yuille

Measuring CLEVRness: Blackbox testing of Visual Reasoning Models

How can we measure the reasoning capabilities of intelligence systems? Visual question answering provides a convenient framework for testing the model's abilities by interrogating the model through questions about the scene. However,…

Machine Learning · Computer Science 2022-03-01 Spyridon Mouselinos , Henryk Michalewski , Mateusz Malinowski

Learning Visual Reasoning Without Strong Priors

Achieving artificial visual reasoning - the ability to answer image-related questions which require a multi-step, high-level process - is an important step towards artificial general intelligence. This multi-modal task requires learning a…

Computer Vision and Pattern Recognition · Computer Science 2017-12-20 Ethan Perez , Harm de Vries , Florian Strub , Vincent Dumoulin , Aaron Courville

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover shortcomings. Existing benchmarks for visual question answering can help,…

Computer Vision and Pattern Recognition · Computer Science 2016-12-22 Justin Johnson , Bharath Hariharan , Laurens van der Maaten , Li Fei-Fei , C. Lawrence Zitnick , Ross Girshick

Learning to reason over visual objects

A core component of human intelligence is the ability to identify abstract patterns inherent in complex, high-dimensional perceptual data, as exemplified by visual reasoning tasks such as Raven's Progressive Matrices (RPM). Motivated by the…

Computer Vision and Pattern Recognition · Computer Science 2023-10-30 Shanka Subhra Mondal , Taylor Webb , Jonathan D. Cohen

Learning to Agree on Vision Attention for Visual Commonsense Reasoning

Visual Commonsense Reasoning (VCR) remains a significant yet challenging research problem in the realm of visual reasoning. A VCR model generally aims at answering a textual question regarding an image, followed by the rationale prediction…

Computer Vision and Pattern Recognition · Computer Science 2023-02-21 Zhenyang Li , Yangyang Guo , Kejie Wang , Fan Liu , Liqiang Nie , Mohan Kankanhalli

V-PROM: A Benchmark for Visual Reasoning Using Visual Progressive Matrices

One of the primary challenges faced by deep learning is the degree to which current methods exploit superficial statistics and dataset bias, rather than learning to generalise over the specific representations they have experienced. This is…

Computer Vision and Pattern Recognition · Computer Science 2019-07-30 Damien Teney , Peng Wang , Jiewei Cao , Lingqiao Liu , Chunhua Shen , Anton van den Hengel

VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making

While Large Language Models (LLMs) excel at reasoning on text and Vision-Language Models (VLMs) are highly effective for visual perception, applying those models for visual instruction-based planning remains a widely open problem. In this…

Machine Learning · Computer Science 2025-09-11 Mohamed Salim Aissi , Clemence Grislain , Mohamed Chetouani , Olivier Sigaud , Laure Soulier , Nicolas Thome

CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images

Most existing research on visual question answering (VQA) is limited to information explicitly present in an image or a video. In this paper, we take visual understanding to a higher level where systems are challenged to answer questions…

Computer Vision and Pattern Recognition · Computer Science 2021-04-14 Shailaja Keyur Sampat , Akshay Kumar , Yezhou Yang , Chitta Baral

Visual Question Answering From Another Perspective: CLEVR Mental Rotation Tests

Different types of mental rotation tests have been used extensively in psychology to understand human visual reasoning and perception. Understanding what an object or visual scene would look like from another viewpoint is a challenging…

Machine Learning · Statistics 2022-12-06 Christopher Beckham , Martin Weiss , Florian Golemo , Sina Honari , Derek Nowrouzezahrai , Christopher Pal

The meaning of "most" for visual question answering models

The correct interpretation of quantifier statements in the context of a visual scene requires non-trivial inference mechanisms. For the example of "most", we discuss two strategies which rely on fundamentally different cognitive concepts.…

Computer Vision and Pattern Recognition · Computer Science 2019-06-05 Alexander Kuhnle , Ann Copestake

Attention Mechanism based Cognition-level Scene Understanding

Given a question-image input, the Visual Commonsense Reasoning (VCR) model can predict an answer with the corresponding rationale, which requires inference ability from the real world. The VCR task, which calls for exploiting the…

Computer Vision and Pattern Recognition · Computer Science 2025-03-10 Xuejiao Tang , Wenbin Zhang

Towards Visually Explaining Similarity Models

We consider the problem of visually explaining similarity models, i.e., explaining why a model predicts two images to be similar in addition to producing a scalar score. While much recent work in visual model interpretability has focused on…

Computer Vision and Pattern Recognition · Computer Science 2020-10-15 Meng Zheng , Srikrishna Karanam , Terrence Chen , Richard J. Radke , Ziyan Wu

Object-Centric Diagnosis of Visual Reasoning

When answering questions about an image, it not only needs knowing what -- understanding the fine-grained contents (e.g., objects, relationships) in the image, but also telling why -- reasoning over grounding visual cues to derive the…

Computer Vision and Pattern Recognition · Computer Science 2020-12-22 Jianwei Yang , Jiayuan Mao , Jiajun Wu , Devi Parikh , David D. Cox , Joshua B. Tenenbaum , Chuang Gan

Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning

Visual question answering requires high-order reasoning about an image, which is a fundamental capability needed by machine systems to follow complex directives. Recently, modular networks have been shown to be an effective framework for…

Computer Vision and Pattern Recognition · Computer Science 2019-01-24 David Mascharka , Philip Tran , Ryan Soklaski , Arjun Majumdar

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Large vision-and-language models (VLMs) trained to match images with text on large-scale datasets of image-text pairs have shown impressive generalization ability on several vision and language tasks. Several recent works, however, showed…

Computer Vision and Pattern Recognition · Computer Science 2024-03-07 Navid Rajabi , Jana Kosecka

CLEVRER: CoLlision Events for Video REpresentation and Reasoning

The ability to reason about temporal and causal events from videos lies at the core of human intelligence. Most video reasoning benchmarks, however, focus on pattern recognition from complex visual and language input, instead of on causal…

Computer Vision and Pattern Recognition · Computer Science 2020-03-10 Kexin Yi , Chuang Gan , Yunzhu Li , Pushmeet Kohli , Jiajun Wu , Antonio Torralba , Joshua B. Tenenbaum

Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension

Referring expression comprehension (REF) aims at identifying a particular object in a scene by a natural language expression. It requires joint reasoning over the textual and visual domains to solve the problem. Some popular referring…

Computer Vision and Pattern Recognition · Computer Science 2020-03-03 Zhenfang Chen , Peng Wang , Lin Ma , Kwan-Yee K. Wong , Qi Wu

Visual Referring Expression Recognition: What Do Systems Actually Learn?

We present an empirical analysis of the state-of-the-art systems for referring expression recognition -- the task of identifying the object in an image referred to by a natural language expression -- with the goal of gaining insight into…

Computation and Language · Computer Science 2018-05-31 Volkan Cirik , Louis-Philippe Morency , Taylor Berg-Kirkpatrick