Related papers: RenderMem: Rendering as Spatial Memory Retrieval

REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories

Humans build viewpoint-independent cognitive maps through navigation, enabling intuitive reasoning about object permanence and spatial relations. We argue that multimodal large language models (MLLMs), despite extensive video training, lack…

Machine Learning · Computer Science 2025-12-02 Jacob Thompson , Emiliano Garcia-Lopez , Yonatan Bisk

3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning

Constructing compact and informative 3D scene representations is essential for effective embodied exploration and reasoning, especially in complex environments over extended periods. Existing representations, such as object-centric 3D scene…

Computer Vision and Pattern Recognition · Computer Science 2025-04-07 Yuncong Yang , Han Yang , Jiachen Zhou , Peihao Chen , Hongxin Zhang , Yilun Du , Chuang Gan

BrainMem: Brain-Inspired Evolving Memory for Embodied Agent Task Planning

Embodied task planning requires agents to execute long-horizon, goal-directed actions in complex 3D environments, where success depends on both immediate perception and accumulated experience across tasks. However, most existing LLM-based…

Robotics · Computer Science 2026-04-21 Xiaoyu Ma , Lianyu Hu , Wenbing Tang , Zixuan Hu , Zeqin Liao , Zhizhen Wu , Yang Liu

A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding

Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries, which is crucial for embodied AI applications such as autonomous navigation, robotics, and augmented reality. Learning 3D language…

Computer Vision and Pattern Recognition · Computer Science 2025-07-10 Zhenyang Liu , Sixiao Zheng , Siyu Chen , Cairong Zhao , Longfei Liang , Xiangyang Xue , Yanwei Fu

Vision to Geometry: 3D Spatial Memory for Sequential Embodied MLLM Reasoning and Exploration

Embodied agents are expected to assist humans by actively exploring unknown environments and reasoning about spatial contexts. When deployed in real life, agents often face sequential tasks where each new task follows the completion of the…

Computer Vision and Pattern Recognition · Computer Science 2026-03-19 Zhongyi Cai , Yi Du , Chen Wang , Yu Kong

SpatialMem: Metric-Aligned Long-Horizon Video Memory for Language Grounding and QA

We present SpatialMem, a memory-centric system for long-horizon, language-grounded retrieval and QA from egocentric video, where metric 3D serves as an interpretable indexing scaffold rather than an explicit mapping objective. Starting from…

Computer Vision and Pattern Recognition · Computer Science 2026-03-09 Xinyi Zheng , Yunze Liu , Chi-Hao Wu , Fan Zhang , Hao Zheng , Wenqi Zhou , Walterio W. Mayol-Cuevas , Junxiao Shen

R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space

Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to…

Computer Vision and Pattern Recognition · Computer Science 2025-12-19 Tin Stribor Sohn , Maximilian Dillitzer , Jason J. Corso , Eric Sax

WorldMem: Long-term Consistent World Simulation with Memory

World simulation has gained increasing popularity due to its ability to model virtual environments and predict the consequences of actions. However, the limited temporal context window often leads to failures in maintaining long-term…

Computer Vision and Pattern Recognition · Computer Science 2026-01-05 Zeqi Xiao , Yushi Lan , Yifan Zhou , Wenqi Ouyang , Shuai Yang , Yanhong Zeng , Xingang Pan

Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances

Functional affordance grounding requires more than recognizing an object: an agent must localize the specific region that supports an interaction, such as the handle to pull or the button to press. This is difficult for training-free…

Computer Vision and Pattern Recognition · Computer Science 2026-05-13 Qirui Wang , Jingyi He , Yining Pan , Xulei Yang , Shijie Li

Rethinking the Simulation vs. Rendering Dichotomy: No Free Lunch in Spatial World Modelling

Spatial world models, representations that support flexible reasoning about spatial relations, are central to developing computational models that could operate in the physical world, but their precise mechanistic underpinnings are nuanced…

Neurons and Cognition · Quantitative Biology 2025-10-27 Dezhi Luo , Qingying Gao , Hokin Deng

LanteRn: Latent Visual Structured Reasoning

While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks…

Computer Vision and Pattern Recognition · Computer Science 2026-03-27 André G. Viveiros , Nuno Gonçalves , Matthias Lindemann , André Martins

RieMind: Geometry-Grounded Spatial Agent for Scene Understanding

Visual Language Models (VLMs) have increasingly become the main paradigm for understanding indoor scenes, but they still struggle with metric and spatial reasoning. Current approaches rely on end-to-end video understanding or large-scale…

Computer Vision and Pattern Recognition · Computer Science 2026-03-17 Fernando Ropero , Erkin Turkoz , Daniel Matos , Junqing Du , Antonio Ruiz , Yanfeng Zhang , Lu Liu , Mingwei Sun , Yongliang Wang

Embodied Spatial Intelligence: from Implicit Scene Modeling to Spatial Reasoning

This thesis introduces "Embodied Spatial Intelligence" to address the challenge of creating robots that can perceive and act in the real world based on natural language instructions. To bridge the gap between Large Language Models (LLMs)…

Robotics · Computer Science 2025-09-03 Jiading Fang

Beyond Pixels: Introducing Geometric-Semantic World Priors for Video-based Embodied Models via Spatio-temporal Alignment

Achieving human-like reasoning in deep learning models for complex tasks in unknown environments remains a critical challenge in embodied intelligence. While advanced vision-language models (VLMs) excel in static scene understanding, their…

Computer Vision and Pattern Recognition · Computer Science 2025-09-03 Jinzhou Tang , Jusheng zhang , Sidi Liu , Waikit Xiu , Qinhan Lv , Xiying Li

GSMem: 3D Gaussian Splatting as Persistent Spatial Memory for Zero-Shot Embodied Exploration and Reasoning

Effective embodied exploration requires agents to accumulate and retain spatial knowledge over time. However, existing scene representations, such as discrete scene graphs or static view-based snapshots, lack \textit{post-hoc…

Computer Vision and Pattern Recognition · Computer Science 2026-03-20 Yiren Lu , Yi Du , Disheng Liu , Yunlai Zhou , Chen Wang , Yu Yin

Embodied Language Grounding with 3D Visual Feature Representations

We propose associating language utterances to 3D visual abstractions of the scene they describe. The 3D visual abstractions are encoded as 3-dimensional visual feature maps. We infer these 3D visual scene feature maps from RGB images of the…

Computer Vision and Pattern Recognition · Computer Science 2021-06-21 Mihir Prabhudesai , Hsiao-Yu Fish Tung , Syed Ashar Javed , Maximilian Sieb , Adam W. Harley , Katerina Fragkiadaki

RoamScene3D: Immersive Text-to-3D Scene Generation via Adaptive Object-aware Roaming

Generating immersive 3D scenes from texts is a core task in computer vision, crucial for applications in virtual reality and game development. Despite the promise of leveraging 2D diffusion priors, existing methods suffer from spatial…

Computer Vision and Pattern Recognition · Computer Science 2026-01-28 Jisheng Chu , Wenrui Li , Rui Zhao , Wangmeng Zuo , Shifeng Chen , Xiaopeng Fan

Visual Agentic AI for Spatial Reasoning with a Dynamic API

Visual reasoning -- the ability to interpret the visual world -- is crucial for embodied agents that operate within three-dimensional scenes. Progress in AI has led to vision and language models capable of answering questions from images.…

Computer Vision and Pattern Recognition · Computer Science 2025-03-31 Damiano Marsili , Rohun Agrawal , Yisong Yue , Georgia Gkioxari

REMem: Reasoning with Episodic Memory in Language Agent

Humans excel at remembering concrete experiences along spatiotemporal contexts and performing reasoning across those events, i.e., the capacity for episodic memory. In contrast, memory in language agents remains mainly semantic, and current…

Artificial Intelligence · Computer Science 2026-03-03 Yiheng Shu , Saisri Padmaja Jonnalagedda , Xiang Gao , Bernal Jiménez Gutiérrez , Weijian Qi , Kamalika Das , Huan Sun , Yu Su

Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning

Humans can perceive and reason about spatial relationships from sequential visual observations, such as egocentric video streams. However, how pretrained models acquire such abilities, especially high-level reasoning, remains unclear. This…

Artificial Intelligence · Computer Science 2025-04-18 Baining Zhao , Ziyou Wang , Jianjie Fang , Chen Gao , Fanhang Man , Jinqiang Cui , Xin Wang , Xinlei Chen , Yong Li , Wenwu Zhu