English
Related papers

Related papers: Spatial Forcing: Implicit Spatial Representation A…

200 papers

Vision-Language-Action (VLA) models show promise for robotic control, yet performance in complex household environments remains sub-optimal. Mobile manipulation requires reasoning about global scene layout, fine-grained geometry, and…

Robotics · Computer Science 2026-03-25 Ruisen Tu , Arth Shukla , Sohyun Yoo , Xuanlin Li , Junxi Li , Jianwen Xie , Hao Su , Zhuowen Tu

Vision-Language-Action (VLA) models provide a promising paradigm for robot learning by integrating visual perception with language-guided policy learning. However, most existing approaches rely on 2D visual inputs to perform actions in 3D…

Robotics · Computer Science 2025-12-16 Yicheng Feng , Wanpeng Zhang , Ye Wang , Hao Luo , Haoqi Yuan , Sipeng Zheng , Zongqing Lu

Vision-Language-Action (VLA) models exhibit strong generalization in robotic manipulation, yet reinforcement learning (RL) fine-tuning often degrades robustness under spatial distribution shifts. For flow-matching VLA policies, this…

Robotics · Computer Science 2026-02-03 Xu Pan , Zhenglin Wan , Xingrui Yu , Xianwei Zheng , Youkai Ke , Ming Sun , Rui Wang , Ziwei Wang , Ivor Tsang

Precise spatial reasoning is fundamental to robotic manipulation, yet the visual backbones of current vision-language-action (VLA) models are predominantly pretrained on 2D image data without explicit 3D geometric supervision, resulting in…

Vision-language-action (VLA) models show potential for general robotic tasks, but remain challenging in spatiotemporally coherent manipulation, which requires fine-grained representations. Typically, existing methods embed 3D positions into…

Computer Vision and Pattern Recognition · Computer Science 2025-11-24 Hanyu Zhou , Chuanhao Ma , Gim Hee Lee

Vision-Language-Action (VLA) models have emerged as a promising framework for enabling generalist robots capable of perceiving, reasoning, and acting in the real world. These models usually build upon pretrained Vision-Language Models…

Robotics · Computer Science 2025-11-25 Tao Lin , Gen Li , Yilei Zhong , Yanwen Zou , Yuxin Du , Jiting Liu , Encheng Gu , Bo Zhao

Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require…

In this paper, we claim that spatial understanding is the keypoint in robot manipulation, and propose SpatialVLA to explore effective spatial representations for the robot foundation model. Specifically, we introduce Ego3D Position Encoding…

Robotics · Computer Science 2025-05-20 Delin Qu , Haoming Song , Qizhi Chen , Yuanqi Yao , Xinyi Ye , Yan Ding , Zhigang Wang , JiaYuan Gu , Bin Zhao , Dong Wang , Xuelong Li

Vision-Language-Action (VLA) models have recently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spatial…

Computer Vision and Pattern Recognition · Computer Science 2025-10-16 Tianyuan Yuan , Yicheng Liu , Chenhao Lu , Zhuoguang Chen , Tao Jiang , Hang Zhao

Robotic manipulation in open-world environments requires reasoning across semantics, geometry, and long-horizon action dynamics. Existing hierarchical Vision-Language-Action (VLA) frameworks typically use 2D representations to connect…

Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive…

Robotics · Computer Science 2025-12-23 Max Argus , Jelena Bratulic , Houman Masnavi , Maxim Velikanov , Nick Heppert , Abhinav Valada , Thomas Brox

Vision-language-action models (VLAs) have shown generalization capabilities in robotic manipulation tasks by inheriting from vision-language models (VLMs) and learning action generation. Most VLA models focus on interpreting vision and…

Vision-Language-Action (VLA) models have recently emerged as powerful generalists for robotic manipulation. However, due to their predominant reliance on visual modalities, they fundamentally lack the physical intuition required for…

Robotics · Computer Science 2026-02-02 Yuzhe Huang , Pei Lin , Wanlin Li , Daohan Li , Jiajun Li , Jiaming Jiang , Chenxi Xiao , Ziyuan Jiao

Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial…

Computer Vision and Pattern Recognition · Computer Science 2026-05-15 Tao Lin , Yuxin Du , Jiting Liu , Nuobei Zhu , Yunhe Li , Yuqian Fu , Yinxinyu Chen , Hongyi Cai , Zewei Ye , Bing Cheng , Kai Ye , Yiran Mao , Yilei Zhong , MingKang Dong , Junchi Yan , Gen Li , Bo Zhao

Large vision-language models (VLMs) excel at multimodal understanding but fall short when extended to embodied tasks, where instructions must be transformed into low-level motor actions. We introduce ST4VLA, a dual-system…

Vision-Language-Action (VLA) models have recently made significant advance in multi-task, end-to-end robotic control, due to the strong generalization capabilities of Vision-Language Models (VLMs). A fundamental challenge in developing such…

Robotics · Computer Science 2025-06-17 Yuqing Wen , Kefan Gu , Haoxuan Liu , Yucheng Zhao , Tiancai Wang , Haoqiang Fan , Xiaoyan Sun

Vision-language-action (VLA) models integrate visual observations and language instructions to predict robot actions, demonstrating promising generalization in manipulation tasks. However, most existing approaches primarily rely on direct…

Robotics · Computer Science 2026-03-02 Jiasong Xiao , Yutao She , Kai Li , Yuyang Sha , Ziang Cheng , Ziang Tong

Recent advances in robot manipulation have leveraged pre-trained vision-language models (VLMs) and explored integrating 3D spatial signals into these models for effective action prediction, giving rise to the promising…

Robotics · Computer Science 2026-01-14 Zhenyang Liu , Yongchong Gu , Yikai Wang , Xiangyang Xue , Yanwei Fu

Vision-Language-Action (VLA) models map multimodal inputs directly to robot actions and are typically trained through large-scale imitation learning. While this paradigm has shown strong performance, prevailing VLA training procedures do…

Machine Learning · Computer Science 2026-05-06 Yubai Wei , Chen Wu , Hashem Haghbayan

Visual-spatial understanding, the ability to infer object relationships and layouts from visual input, is fundamental to downstream tasks such as robotic navigation and embodied interaction. However, existing methods face spatial…

Computer Vision and Pattern Recognition · Computer Science 2025-09-22 Haoyu Zhang , Meng Liu , Zaijing Li , Haokun Wen , Weili Guan , Yaowei Wang , Liqiang Nie
‹ Prev 1 2 3 10 Next ›