Related papers: LanteRn: Latent Visual Structured Reasoning

Vision-aligned Latent Reasoning for Multi-modal Large Language Model

Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution…

Computer Vision and Pattern Recognition · Computer Science 2026-05-13 Byungwoo Jeon , Yoonwoo Jeong , Hyunseok Lee , Minsu Cho , Jinwoo Shin

Latent Visual Reasoning

Multimodal Large Language Models (MLLMs) have achieved notable gains in various tasks by incorporating Chain-of-Thought (CoT) reasoning in language spaces. Recent work extends this direction by leveraging external tools for visual editing,…

Computer Vision and Pattern Recognition · Computer Science 2025-10-07 Bangzheng Li , Ximeng Sun , Jiang Liu , Ze Wang , Jialian Wu , Xiaodong Yu , Hao Chen , Emad Barsoum , Muhao Chen , Zicheng Liu

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render…

Computer Vision and Pattern Recognition · Computer Science 2025-06-23 Zeyuan Yang , Xueyang Yu , Delin Chen , Maohao Shen , Chuang Gan

Interleaved Latent Visual Reasoning with Selective Perceptual Modeling

Interleaved reasoning paradigms enhance Multimodal Large Language Models (MLLMs) with visual feedback but are hindered by the prohibitive computational cost of re-encoding pixel-dense images. A promising alternative, latent visual…

Computation and Language · Computer Science 2026-01-22 Shuai Dong , Siyuan Wang , Xingyu Liu , Chenglin Li , Haowen Hou , Zhongyu Wei

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

Transforming a large language model (LLM) into a Vision-Language Model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP…

Computer Vision and Pattern Recognition · Computer Science 2026-02-26 Benno Krojer , Shravan Nayak , Oscar Mañas , Vaibhav Adlakha , Desmond Elliott , Siva Reddy , Marius Mosbach

UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

Multimodal large language models are increasingly expected to perform thinking with images, yet existing visual latent reasoning methods still rely on explicit textual chain-of-thought interleaved with visual latent tokens. This interleaved…

Computer Vision and Pattern Recognition · Computer Science 2026-05-13 Houcheng Jiang , Jiajun Fu , Junfeng Fang , Chen Gao , Xiang Wang , Xiangnan He , Yong Li

Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

Continuous latent-space reasoning offers a compact alternative to textual chain-of-thought for multimodal models, enabling high-dimensional visual evidence to be integrated without explicit reasoning tokens. However, we identify a…

Machine Learning · Computer Science 2026-05-05 Xin Zhang , Qiqi Tao , Jiawei Du , Moyun Liu , Joey Tianyi Zhou

Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space

Multimodal reasoning aims to enhance the capabilities of MLLMs by incorporating intermediate reasoning steps before reaching the final answer. It has evolved from text-only reasoning to the integration of visual information, enabling the…

Computer Vision and Pattern Recognition · Computer Science 2026-01-29 Chao Chen , Zhixin Ma , Yongqi Li , Yupeng Hu , Yinwei Wei , Wenjie Li , Liqiang Nie

Latent Implicit Visual Reasoning

While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are…

Computer Vision and Pattern Recognition · Computer Science 2025-12-25 Kelvin Li , Chuyi Shang , Leonid Karlinsky , Rogerio Feris , Trevor Darrell , Roei Herzig

Revisiting Visual Understanding in Multimodal Reasoning through a Lens of Image Perturbation

Despite the rapid progress of multimodal large language models (MLLMs), they have largely overlooked the importance of visual processing. In a simple yet revealing experiment, we interestingly find that language-only models, when provided…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Yuting Li , Lai Wei , Kaipeng Zheng , Jingyuan Huang , Guilin Li , Bo Wang , Linghe Kong , Lichao Sun , Weiran Huang

Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding and reasoning by incorporating Chain-of-Thought (CoT) reasoning in the semantic space. Building upon this, recent studies…

Computer Vision and Pattern Recognition · Computer Science 2026-04-10 Chengzhi Liu , Yuzhe Yang , Yue Fan , Qingyue Wei , Sheng Liu , Xin Eric Wang

REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories

Humans build viewpoint-independent cognitive maps through navigation, enabling intuitive reasoning about object permanence and spatial relations. We argue that multimodal large language models (MLLMs), despite extensive video training, lack…

Machine Learning · Computer Science 2025-12-02 Jacob Thompson , Emiliano Garcia-Lopez , Yonatan Bisk

Semantic-Enriched Latent Visual Reasoning

Multimodal latent-space reasoning aims to replace explicit thinking with images by performing visual reasoning directly in a compact latent space. However, existing approaches largely rely on visual supervision and produce latent…

Computer Vision and Pattern Recognition · Computer Science 2026-05-28 Tianrun Xu , Yue Sun , Qixun Wang , Jingyi Lu , Yuan Wang , Tianren Zhang , Longteng Guo , Fengyun Rao , Jing Lyu , Feng Chen , Jing Liu

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods…

Computer Vision and Pattern Recognition · Computer Science 2025-06-23 Junfei Wu , Jian Guan , Kaituo Feng , Qiang Liu , Shu Wu , Liang Wang , Wei Wu , Tieniu Tan

Monet: Reasoning in Latent Visual Space Beyond Images and Language

"Thinking with images" has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of…

Computer Vision and Pattern Recognition · Computer Science 2025-12-01 Qixun Wang , Yang Shi , Yifei Wang , Yuanxing Zhang , Pengfei Wan , Kun Gai , Xianghua Ying , Yisen Wang

Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark

Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque;…

Computer Vision and Pattern Recognition · Computer Science 2025-12-05 Haobo Yuan , Yueyi Sun , Yanwei Li , Tao Zhang , Xueqing Deng , Henghui Ding , Lu Qi , Anran Wang , Xiangtai Li , Ming-Hsuan Yang

Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation

Spatial reasoning -- the ability to perceive and reason about relationships in space -- advances vision-language models (VLMs) from visual perception toward spatial semantic understanding. Existing approaches either revisit local image…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Weijian Ma , Shizhao Sun , Tianyu Yu , Ruiyu Wang , Tat-Seng Chua , Jiang Bian

Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving

Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs) and use end-to-end training to achieve multi-modal understanding in a unified…

Artificial Intelligence · Computer Science 2025-08-14 Zixian Guo , Ming Liu , Qilong Wang , Zhilong Ji , Jinfeng Bai , Lei Zhang , Wangmeng Zuo

Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in multimodal tasks. Despite their impressive performance, MLLMs suffer from the modality imbalance issue, where visual information is often underutilized…

Computer Vision and Pattern Recognition · Computer Science 2025-12-09 Hengzhuang Li , Xinsong Zhang , Qiming Peng , Bin Luo , Han Hu , Dengyang Jiang , Han-Jia Ye , Teng Zhang , Hai Jin

Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective

Spatial reasoning is a core aspect of human intelligence that allows perception, inference and planning in 3D environments. However, current vision-language models (VLMs) struggle to maintain geometric coherence and cross-view consistency…

Artificial Intelligence · Computer Science 2025-12-03 Qiyao Xue , Weichen Liu , Shiqi Wang , Haoming Wang , Yuyang Wu , Wei Gao