English
Related papers

Related papers: SpatialVLA: Exploring Spatial Representations for …

200 papers

Vision-Language-Action (VLA) models show promise for robotic control, yet performance in complex household environments remains sub-optimal. Mobile manipulation requires reasoning about global scene layout, fine-grained geometry, and…

Robotics · Computer Science 2026-03-25 Ruisen Tu , Arth Shukla , Sohyun Yoo , Xuanlin Li , Junxi Li , Jianwen Xie , Hao Su , Zhuowen Tu

Vision-Language-Action (VLA) models provide a promising paradigm for robot learning by integrating visual perception with language-guided policy learning. However, most existing approaches rely on 2D visual inputs to perform actions in 3D…

Robotics · Computer Science 2025-12-16 Yicheng Feng , Wanpeng Zhang , Ye Wang , Hao Luo , Haoqi Yuan , Sipeng Zheng , Zongqing Lu

Vision-language-action (VLA) models show potential for general robotic tasks, but remain challenging in spatiotemporally coherent manipulation, which requires fine-grained representations. Typically, existing methods embed 3D positions into…

Computer Vision and Pattern Recognition · Computer Science 2025-11-24 Hanyu Zhou , Chuanhao Ma , Gim Hee Lee

Recent advances in robot manipulation have leveraged pre-trained vision-language models (VLMs) and explored integrating 3D spatial signals into these models for effective action prediction, giving rise to the promising…

Robotics · Computer Science 2026-01-14 Zhenyang Liu , Yongchong Gu , Yikai Wang , Xiangyang Xue , Yanwei Fu

Robotic manipulation in open-world environments requires reasoning across semantics, geometry, and long-horizon action dynamics. Existing hierarchical Vision-Language-Action (VLA) frameworks typically use 2D representations to connect…

Vision-language-action models have emerged as a crucial paradigm in robotic manipulation. However, existing VLA models exhibit notable limitations in handling ambiguous language instructions and unknown environmental states. Furthermore,…

Robotics · Computer Science 2025-08-26 Helong Huang , Min Cen , Kai Tan , Xingyue Quan , Guowei Huang , Hong Zhang

Vision-Language-Action (VLA) models have emerged as a promising approach for enabling robots to follow language instructions and predict corresponding actions. However, current VLA models mainly rely on 2D visual inputs, neglecting the rich…

Robotics · Computer Science 2025-08-14 Lin Sun , Bin Xie , Yingfei Liu , Hao Shi , Tiancai Wang , Jiale Cao

We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward scalable, general-purpose intelligence. Its core idea is spatially guided vision-language-action…

Vision-language-action (VLA) models integrate visual observations and language instructions to predict robot actions, demonstrating promising generalization in manipulation tasks. However, most existing approaches primarily rely on direct…

Robotics · Computer Science 2026-03-02 Jiasong Xiao , Yutao She , Kai Li , Yuyang Sha , Ziang Cheng , Ziang Tong

Vision-Language-Action (VLA) models have emerged as a promising framework for enabling generalist robots capable of perceiving, reasoning, and acting in the real world. These models usually build upon pretrained Vision-Language Models…

Robotics · Computer Science 2025-11-25 Tao Lin , Gen Li , Yilei Zhong , Yanwen Zou , Yuxin Du , Jiting Liu , Encheng Gu , Bo Zhao

Vision-language-action (VLA) models have recently shown strong potential in enabling robots to follow language instructions and execute precise actions. However, most VLAs are built upon vision-language models pretrained solely on 2D data,…

Robotics · Computer Science 2025-10-20 Fuhao Li , Wenxuan Song , Han Zhao , Jingbo Wang , Pengxiang Ding , Donglin Wang , Long Zeng , Haoang Li

Large vision-language models (VLMs) excel at multimodal understanding but fall short when extended to embodied tasks, where instructions must be transformed into low-level motor actions. We introduce ST4VLA, a dual-system…

Vision Language Action (VLA) models represent a transformative shift in robotics, with the aim of unifying visual perception, natural language understanding, and embodied control within a single learning framework. This review presents a…

Robotics · Computer Science 2026-01-21 Muhayy Ud Din , Waseem Akram , Lyes Saad Saoud , Jan Rosell , Irfan Hussain

Robotic manipulation in 3D requires effective computation of N degree-of-freedom joint-space trajectories that enable precise and robust control. To achieve this, robots must integrate semantic understanding with visual perception to…

Robotics · Computer Science 2026-03-31 Vineet Bhat , Yu-Hsiang Lan , Prashanth Krishnamurthy , Ramesh Karri , Farshad Khorrami

Vision-language-action models (VLAs) have shown generalization capabilities in robotic manipulation tasks by inheriting from vision-language models (VLMs) and learning action generation. Most VLA models focus on interpreting vision and…

Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive…

Robotics · Computer Science 2025-12-23 Max Argus , Jelena Bratulic , Houman Masnavi , Maxim Velikanov , Nick Heppert , Abhinav Valada , Thomas Brox

Generalization remains a fundamental challenge in robotic manipulation. To tackle this challenge, recent Vision-Language-Action (VLA) models build policies on top of Vision-Language Models (VLMs), seeking to transfer their open-world…

Vision-Language-Action (VLA) models frequently encounter challenges in generalizing to real-world environments due to inherent discrepancies between observation and action spaces. Although training data are collected from diverse camera…

Robotics · Computer Science 2025-08-19 Tianyi Zhang , Haonan Duan , Haoran Hao , Yu Qiao , Jifeng Dai , Zhi Hou

Large foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics. One fundamental challenge is that the models exhibit…

Robotics · Computer Science 2026-02-05 Guoqing Ma , Siheng Wang , Zeyu Zhang , Shan Yu , Hao Tang

Vision-Language-Action (VLA) models are a promising path to realizing generalist embodied agents that can quickly adapt to new tasks, modalities, and environments. However, methods for interpreting and steering VLAs fall far short of…

Robotics · Computer Science 2025-09-03 Bear Häon , Kaylene Stocking , Ian Chuang , Claire Tomlin
‹ Prev 1 2 3 10 Next ›