Related papers: Visually Interpretable Subtask Reasoning for Visua…

Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training

Open-source multimodal large language models (MLLMs) excel in various tasks involving textual and visual inputs but still struggle with complex multimodal mathematical reasoning, lagging behind proprietary models like GPT-4V(ision) and…

Computation and Language · Computer Science 2024-04-29 Mengzhao Jia , Zhihan Zhang , Wenhao Yu , Fangkai Jiao , Meng Jiang

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods…

Computer Vision and Pattern Recognition · Computer Science 2025-06-23 Junfei Wu , Jian Guan , Kaituo Feng , Qiang Liu , Shu Wu , Liang Wang , Wei Wu , Tieniu Tan

Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models

Vision-Language Models (VLMs) are becoming the cornerstone of high-level reasoning for robotic automation, enabling robots to parse natural language commands and perceive their environments. However, their susceptibility to hallucinations…

Artificial Intelligence · Computer Science 2026-05-20 Weicong Ni , Tianbao Jiang , Linlin Wang

iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs

Vision-Language Models (VLMs) are known to struggle with spatial reasoning and visual alignment. To help overcome these limitations, we introduce iVISPAR, an interactive multimodal benchmark designed to evaluate the spatial reasoning…

Computation and Language · Computer Science 2025-10-01 Julius Mayer , Mohamad Ballout , Serwan Jassim , Farbod Nosrat Nezami , Elia Bruni

Think Twice to See More: Iterative Visual Reasoning in Medical VLMs

Medical vision-language models (VLMs) excel at image-text understanding but typically rely on a single-pass reasoning that neglects localized visual cues. In clinical practice, however, human experts iteratively scan, focus, and refine the…

Computer Vision and Pattern Recognition · Computer Science 2025-10-14 Kaitao Chen , Shaohao Rui , Yankai Jiang , Jiamin Wu , Qihao Zheng , Chunfeng Song , Xiaosong Wang , Mu Zhou , Mianxin Liu

Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark

Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque;…

Computer Vision and Pattern Recognition · Computer Science 2025-12-05 Haobo Yuan , Yueyi Sun , Yanwei Li , Tao Zhang , Xueqing Deng , Henghui Ding , Lu Qi , Anran Wang , Xiangtai Li , Ming-Hsuan Yang

Differentiate-and-Inject: Enhancing VLAs via Functional Differentiation Induced by In-Parameter Structural Reasoning

As robots are expected to perform increasingly diverse tasks, they must understand not only low-level actions but also the higher-level structure that determines how a task should unfold. Existing vision-language-action (VLA) models…

Robotics · Computer Science 2026-02-10 Jingyi Hou , Leyu Zhou , Chenchen Jing , Jinghan Yang , Xinbo Yu , Wei He

Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework for Visual Commonsense Reasoning

The visual commonsense reasoning (VCR) task is to choose an answer and provide a justifying rationale based on the given image and textural question. Representative works first recognize objects in images and then associate them with key…

Computer Vision and Pattern Recognition · Computer Science 2023-12-27 Jian Zhu , Hanli Wang , Miaojing Shi

VisTR: Visualizations as Representations for Time-series Table Reasoning

Time-series table reasoning interprets temporal patterns and relationships in data to answer user queries. Despite recent advancements leveraging large language models (LLMs), existing methods often struggle with pattern recognition,…

Human-Computer Interaction · Computer Science 2024-12-24 Jianing Hao , Zhuowen Liang , Chunting Li , Yuyu Luo , Jie Li , Wei Zeng

VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making

While Large Language Models (LLMs) excel at reasoning on text and Vision-Language Models (VLMs) are highly effective for visual perception, applying those models for visual instruction-based planning remains a widely open problem. In this…

Machine Learning · Computer Science 2025-09-11 Mohamed Salim Aissi , Clemence Grislain , Mohamed Chetouani , Olivier Sigaud , Laure Soulier , Nicolas Thome

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

Recent advances in Multi-modal Large Language Models (MLLMs) target 3D spatial intelligence, yet the progress has been largely driven by post-training on curated benchmarks, leaving the inference-time approach relatively underexplored. In…

Computer Vision and Pattern Recognition · Computer Science 2026-05-12 Tingshu Mou , Jiabo He , Renying Wang , Ce Liu , Hao Yang , Tiehua Zhang , Jingjing Chen , Xingjun Ma

Vision-aligned Latent Reasoning for Multi-modal Large Language Model

Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution…

Computer Vision and Pattern Recognition · Computer Science 2026-05-13 Byungwoo Jeon , Yoonwoo Jeong , Hyunseok Lee , Minsu Cho , Jinwoo Shin

VIKSER: Visual Knowledge-Driven Self-Reinforcing Reasoning Framework

Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing…

Computer Vision and Pattern Recognition · Computer Science 2025-09-03 Chao Wang , Chunbai Zhang , Yongxiao Tian , Yang Zhou , Yan Peng

ViSTa Dataset: Do vision-language models understand sequential tasks?

Using vision-language models (VLMs) as reward models in reinforcement learning holds promise for reducing costs and improving safety. So far, VLM reward models have only been used for goal-oriented tasks, where the agent must reach a…

Computer Vision and Pattern Recognition · Computer Science 2024-11-22 Evžen Wybitul , Evan Ryan Gunter , Mikhail Seleznyov , David Lindner

Beyond the Black Box: Demystifying Multi-Turn LLM Reasoning with VISTA

Recent research has increasingly focused on the reasoning capabilities of Large Language Models (LLMs) in multi-turn interactions, as these scenarios more closely mirror real-world problem-solving. However, analyzing the intricate reasoning…

Computation and Language · Computer Science 2025-11-14 Yiran Zhang , Mingyang Lin , Mark Dras , Usman Naseem

Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data

Large Multimodal Models (LMMs), or Vision-Language Models (VLMs), have shown impressive capabilities in a wide range of visual tasks. However, they often struggle with fine-grained visual reasoning, failing to identify domain-specific…

Computer Vision and Pattern Recognition · Computer Science 2025-02-26 Yucheng Shi , Quanzheng Li , Jin Sun , Xiang Li , Ninghao Liu

LanteRn: Latent Visual Structured Reasoning

While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks…

Computer Vision and Pattern Recognition · Computer Science 2026-03-27 André G. Viveiros , Nuno Gonçalves , Matthias Lindemann , André Martins

Beyond Perception: Evaluating Abstract Visual Reasoning through Multi-Stage Task

Current Multimodal Large Language Models (MLLMs) excel in general visual reasoning but remain underexplored in Abstract Visual Reasoning (AVR), which demands higher-order reasoning to identify abstract rules beyond simple perception.…

Computer Vision and Pattern Recognition · Computer Science 2025-06-02 Yanbei Jiang , Yihao Ding , Chao Lei , Jiayang Ao , Jey Han Lau , Krista A. Ehinger

VGR: Visual Grounded Reasoning

In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This…

Computer Vision and Pattern Recognition · Computer Science 2026-05-04 Jiacong Wang , Zijian Kang , Haochen Wang , Haiyong Jiang , Jiawen Li , Bohong Wu , Ya Wang , Jiao Ran , Xiao Liang , Chao Feng , Jun Xiao

ViStruct: Simulating Expert-Like Reasoning Through Task Decomposition and Visual Attention Cues

Data visualization tasks often require multi-step reasoning, and the interpretive strategies experts use, such as decomposing complex goals into smaller subtasks and selectively attending to key chart regions are rarely made explicit.…

Human-Computer Interaction · Computer Science 2025-06-30 Oliver Huang , Carolina Nobre