Related papers: Transformation Driven Visual Reasoning

Visual Reasoning: from State to Transformation

Most existing visual reasoning tasks, such as CLEVR in VQA, ignore an important factor, i.e.~transformation. They are solely defined to test how well machines understand concepts and relations within static settings, like one image. Such…

Computer Vision and Pattern Recognition · Computer Science 2023-05-04 Xin Hong , Yanyan Lan , Liang Pang , Jiafeng Guo , Xueqi Cheng

VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning

Visual transformation reasoning (VTR) is a vital cognitive capability that empowers intelligent agents to understand dynamic scenes, model causal relationships, and predict future states, and thereby guiding actions and laying the…

Computer Vision and Pattern Recognition · Computer Science 2025-08-07 Yuheng Ji , Yipu Wang , Yuyang Liu , Xiaoshuai Hao , Yue Liu , Yuting Zhao , Huaihai Lyu , Xiaolong Zheng

Visual Transformation Telling

Humans can naturally reason from superficial state differences (e.g. ground wetness) to transformations descriptions (e.g. raining) according to their life experience. In this paper, we propose a new visual reasoning task to test this…

Computer Vision and Pattern Recognition · Computer Science 2024-06-12 Wanqing Cui , Xin Hong , Yanyan Lan , Liang Pang , Jiafeng Guo , Xueqi Cheng

Learning Visual Reasoning Without Strong Priors

Achieving artificial visual reasoning - the ability to answer image-related questions which require a multi-step, high-level process - is an important step towards artificial general intelligence. This multi-modal task requires learning a…

Computer Vision and Pattern Recognition · Computer Science 2017-12-20 Ethan Perez , Harm de Vries , Florian Strub , Vincent Dumoulin , Aaron Courville

Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network

Explanation and high-order reasoning capabilities are crucial for real-world visual question answering with diverse levels of inference complexity (e.g., what is the dog that is near the girl playing with?) and important for users to…

Computer Vision and Pattern Recognition · Computer Science 2019-09-24 Qingxing Cao , Bailin Li , Xiaodan Liang , Liang Lin

Transfer Learning in Visual and Relational Reasoning

Transfer learning has become the de facto standard in computer vision and natural language processing, especially where labeled data is scarce. Accuracy can be significantly improved by using pre-trained models and subsequent fine-tuning.…

Computer Vision and Pattern Recognition · Computer Science 2020-02-18 T. S. Jayram , Vincent Marois , Tomasz Kornuta , Vincent Albouy , Emre Sevgen , Ahmet S. Ozcan

Latent Visual Reasoning

Multimodal Large Language Models (MLLMs) have achieved notable gains in various tasks by incorporating Chain-of-Thought (CoT) reasoning in language spaces. Recent work extends this direction by leveraging external tools for visual editing,…

Computer Vision and Pattern Recognition · Computer Science 2025-10-07 Bangzheng Li , Ximeng Sun , Jiang Liu , Ze Wang , Jialian Wu , Xiaodong Yu , Hao Chen , Emad Barsoum , Muhao Chen , Zicheng Liu

Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning

Visual question answering requires high-order reasoning about an image, which is a fundamental capability needed by machine systems to follow complex directives. Recently, modular networks have been shown to be an effective framework for…

Computer Vision and Pattern Recognition · Computer Science 2019-01-24 David Mascharka , Philip Tran , Ryan Soklaski , Arjun Majumdar

Thinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced reasoning capabilities in Large Language Models. However, adapting RLVR to multimodal domains suffers from a critical \textit{perception-reasoning decoupling}.…

Artificial Intelligence · Computer Science 2026-01-13 Shujian Gao , Yuan Wang , Jiangtao Yan , Zuxuan Wu , Yu-Gang Jiang

CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog

Visual Dialog is a multimodal task of answering a sequence of questions grounded in an image, using the conversation history as context. It entails challenges in vision, language, reasoning, and grounding. However, studying these subtasks…

Computer Vision and Pattern Recognition · Computer Science 2019-09-20 Satwik Kottur , José M. F. Moura , Devi Parikh , Dhruv Batra , Marcus Rohrbach

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover shortcomings. Existing benchmarks for visual question answering can help,…

Computer Vision and Pattern Recognition · Computer Science 2016-12-22 Justin Johnson , Bharath Hariharan , Laurens van der Maaten , Li Fei-Fei , C. Lawrence Zitnick , Ross Girshick

Understanding the computational demands underlying visual reasoning

Visual understanding requires comprehending complex visual relations between objects within a scene. Here, we seek to characterize the computational demands for abstract visual reasoning. We do this by systematically assessing the ability…

Computer Vision and Pattern Recognition · Computer Science 2022-03-03 Mohit Vaishnav , Remi Cadene , Andrea Alamia , Drew Linsley , Rufin VanRullen , Thomas Serre

Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark

Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque;…

Computer Vision and Pattern Recognition · Computer Science 2025-12-05 Haobo Yuan , Yueyi Sun , Yanwei Li , Tao Zhang , Xueqing Deng , Henghui Ding , Lu Qi , Anran Wang , Xiangtai Li , Ming-Hsuan Yang

Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification

Multi-modal large language models (MLLMs) have achieved remarkable capabilities by integrating visual perception with language understanding, enabling applications such as image-grounded dialogue, visual question answering, and scientific…

Computer Vision and Pattern Recognition · Computer Science 2025-06-10 Tianyi Bai , Zengjie Hu , Fupeng Sun , Jiantao Qiu , Yizhen Jiang , Guangxin He , Bohan Zeng , Conghui He , Binhang Yuan , Wentao Zhang

Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning

Visual Question Answering (VQA) models often perform poorly on out-of-distribution data and struggle on domain generalization. Due to the multi-modal nature of this task, multiple factors of variation are intertwined, making generalization…

Computer Vision and Pattern Recognition · Computer Science 2023-06-02 Zhuowan Li , Xingrui Wang , Elias Stengel-Eskin , Adam Kortylewski , Wufei Ma , Benjamin Van Durme , Alan Yuille

Visual Entailment: A Novel Task for Fine-Grained Image Understanding

Existing visual reasoning datasets such as Visual Question Answering (VQA), often suffer from biases conditioned on the question, image or answer distributions. The recently proposed CLEVR dataset addresses these limitations and requires…

Computer Vision and Pattern Recognition · Computer Science 2019-01-23 Ning Xie , Farley Lai , Derek Doran , Asim Kadav

See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning

Large pre-trained vision and language models have demonstrated remarkable capacities for various tasks. However, solving the knowledge-based visual reasoning tasks remains challenging, which requires a model to comprehensively understand…

Computer Vision and Pattern Recognition · Computer Science 2023-01-13 Zhenfang Chen , Qinhong Zhou , Yikang Shen , Yining Hong , Hao Zhang , Chuang Gan

Visual Concept Reasoning Networks

A split-transform-merge strategy has been broadly used as an architectural constraint in convolutional neural networks for visual recognition tasks. It approximates sparsely connected networks by explicitly defining multiple branches to…

Computer Vision and Pattern Recognition · Computer Science 2020-08-28 Taesup Kim , Sungwoong Kim , Yoshua Bengio

RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning

Reasoning about visual relationships is central to how humans interpret the visual world. This task remains challenging for current deep learning algorithms since it requires addressing three key technical problems jointly: 1) identifying…

Computer Vision and Pattern Recognition · Computer Science 2022-06-14 Xiaojian Ma , Weili Nie , Zhiding Yu , Huaizu Jiang , Chaowei Xiao , Yuke Zhu , Song-Chun Zhu , Anima Anandkumar

LanteRn: Latent Visual Structured Reasoning

While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks…

Computer Vision and Pattern Recognition · Computer Science 2026-03-27 André G. Viveiros , Nuno Gonçalves , Matthias Lindemann , André Martins