English
Related papers

Related papers: Visually Interpretable Subtask Reasoning for Visua…

200 papers

Open-source multimodal large language models (MLLMs) excel in various tasks involving textual and visual inputs but still struggle with complex multimodal mathematical reasoning, lagging behind proprietary models like GPT-4V(ision) and…

Computation and Language · Computer Science 2024-04-29 Mengzhao Jia , Zhihan Zhang , Wenhao Yu , Fangkai Jiao , Meng Jiang

As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods…

Computer Vision and Pattern Recognition · Computer Science 2025-06-23 Junfei Wu , Jian Guan , Kaituo Feng , Qiang Liu , Shu Wu , Liang Wang , Wei Wu , Tieniu Tan

Vision-Language Models (VLMs) are becoming the cornerstone of high-level reasoning for robotic automation, enabling robots to parse natural language commands and perceive their environments. However, their susceptibility to hallucinations…

Artificial Intelligence · Computer Science 2026-05-20 Weicong Ni , Tianbao Jiang , Linlin Wang

Vision-Language Models (VLMs) are known to struggle with spatial reasoning and visual alignment. To help overcome these limitations, we introduce iVISPAR, an interactive multimodal benchmark designed to evaluate the spatial reasoning…

Computation and Language · Computer Science 2025-10-01 Julius Mayer , Mohamad Ballout , Serwan Jassim , Farbod Nosrat Nezami , Elia Bruni

Medical vision-language models (VLMs) excel at image-text understanding but typically rely on a single-pass reasoning that neglects localized visual cues. In clinical practice, however, human experts iteratively scan, focus, and refine the…

Computer Vision and Pattern Recognition · Computer Science 2025-10-14 Kaitao Chen , Shaohao Rui , Yankai Jiang , Jiamin Wu , Qihao Zheng , Chunfeng Song , Xiaosong Wang , Mu Zhou , Mianxin Liu

Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque;…

Computer Vision and Pattern Recognition · Computer Science 2025-12-05 Haobo Yuan , Yueyi Sun , Yanwei Li , Tao Zhang , Xueqing Deng , Henghui Ding , Lu Qi , Anran Wang , Xiangtai Li , Ming-Hsuan Yang

As robots are expected to perform increasingly diverse tasks, they must understand not only low-level actions but also the higher-level structure that determines how a task should unfold. Existing vision-language-action (VLA) models…

Robotics · Computer Science 2026-02-10 Jingyi Hou , Leyu Zhou , Chenchen Jing , Jinghan Yang , Xinbo Yu , Wei He

The visual commonsense reasoning (VCR) task is to choose an answer and provide a justifying rationale based on the given image and textural question. Representative works first recognize objects in images and then associate them with key…

Computer Vision and Pattern Recognition · Computer Science 2023-12-27 Jian Zhu , Hanli Wang , Miaojing Shi

Time-series table reasoning interprets temporal patterns and relationships in data to answer user queries. Despite recent advancements leveraging large language models (LLMs), existing methods often struggle with pattern recognition,…

Human-Computer Interaction · Computer Science 2024-12-24 Jianing Hao , Zhuowen Liang , Chunting Li , Yuyu Luo , Jie Li , Wei Zeng

While Large Language Models (LLMs) excel at reasoning on text and Vision-Language Models (VLMs) are highly effective for visual perception, applying those models for visual instruction-based planning remains a widely open problem. In this…

Machine Learning · Computer Science 2025-09-11 Mohamed Salim Aissi , Clemence Grislain , Mohamed Chetouani , Olivier Sigaud , Laure Soulier , Nicolas Thome

Recent advances in Multi-modal Large Language Models (MLLMs) target 3D spatial intelligence, yet the progress has been largely driven by post-training on curated benchmarks, leaving the inference-time approach relatively underexplored. In…

Computer Vision and Pattern Recognition · Computer Science 2026-05-12 Tingshu Mou , Jiabo He , Renying Wang , Ce Liu , Hao Yang , Tiehua Zhang , Jingjing Chen , Xingjun Ma

Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution…

Computer Vision and Pattern Recognition · Computer Science 2026-05-13 Byungwoo Jeon , Yoonwoo Jeong , Hyunseok Lee , Minsu Cho , Jinwoo Shin

Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing…

Computer Vision and Pattern Recognition · Computer Science 2025-09-03 Chao Wang , Chunbai Zhang , Yongxiao Tian , Yang Zhou , Yan Peng

Using vision-language models (VLMs) as reward models in reinforcement learning holds promise for reducing costs and improving safety. So far, VLM reward models have only been used for goal-oriented tasks, where the agent must reach a…

Computer Vision and Pattern Recognition · Computer Science 2024-11-22 Evžen Wybitul , Evan Ryan Gunter , Mikhail Seleznyov , David Lindner

Recent research has increasingly focused on the reasoning capabilities of Large Language Models (LLMs) in multi-turn interactions, as these scenarios more closely mirror real-world problem-solving. However, analyzing the intricate reasoning…

Computation and Language · Computer Science 2025-11-14 Yiran Zhang , Mingyang Lin , Mark Dras , Usman Naseem

Large Multimodal Models (LMMs), or Vision-Language Models (VLMs), have shown impressive capabilities in a wide range of visual tasks. However, they often struggle with fine-grained visual reasoning, failing to identify domain-specific…

Computer Vision and Pattern Recognition · Computer Science 2025-02-26 Yucheng Shi , Quanzheng Li , Jin Sun , Xiang Li , Ninghao Liu

While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks…

Computer Vision and Pattern Recognition · Computer Science 2026-03-27 André G. Viveiros , Nuno Gonçalves , Matthias Lindemann , André Martins

Current Multimodal Large Language Models (MLLMs) excel in general visual reasoning but remain underexplored in Abstract Visual Reasoning (AVR), which demands higher-order reasoning to identify abstract rules beyond simple perception.…

Computer Vision and Pattern Recognition · Computer Science 2025-06-02 Yanbei Jiang , Yihao Ding , Chao Lei , Jiayang Ao , Jey Han Lau , Krista A. Ehinger

In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This…

Computer Vision and Pattern Recognition · Computer Science 2026-05-04 Jiacong Wang , Zijian Kang , Haochen Wang , Haiyong Jiang , Jiawen Li , Bohong Wu , Ya Wang , Jiao Ran , Xiao Liang , Chao Feng , Jun Xiao

Data visualization tasks often require multi-step reasoning, and the interpretive strategies experts use, such as decomposing complex goals into smaller subtasks and selectively attending to key chart regions are rarely made explicit.…

Human-Computer Interaction · Computer Science 2025-06-30 Oliver Huang , Carolina Nobre
‹ Prev 1 2 3 10 Next ›