Related papers: SpatialVLA: Exploring Spatial Representations for …

SG-VLA: Learning Spatially-Grounded Vision-Language-Action Models for Mobile Manipulation

Vision-Language-Action (VLA) models show promise for robotic control, yet performance in complex household environments remains sub-optimal. Mobile manipulation requires reasoning about global scene layout, fine-grained geometry, and…

Robotics · Computer Science 2026-03-25 Ruisen Tu , Arth Shukla , Sohyun Yoo , Xuanlin Li , Junxi Li , Jianwen Xie , Hao Su , Zhuowen Tu

Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

Vision-Language-Action (VLA) models provide a promising paradigm for robot learning by integrating visual perception with language-guided policy learning. However, most existing approaches rely on 2D visual inputs to perform actions in 3D…

Robotics · Computer Science 2025-12-16 Yicheng Feng , Wanpeng Zhang , Ye Wang , Hao Luo , Haoqi Yuan , Sipeng Zheng , Zongqing Lu

VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation

Vision-language-action (VLA) models show potential for general robotic tasks, but remain challenging in spatiotemporally coherent manipulation, which requires fine-grained representations. Typically, existing methods embed 3D positions into…

Computer Vision and Pattern Recognition · Computer Science 2025-11-24 Hanyu Zhou , Chuanhao Ma , Gim Hee Lee

ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation

Recent advances in robot manipulation have leveraged pre-trained vision-language models (VLMs) and explored integrating 3D spatial signals into these models for effective action prediction, giving rise to the promising…

Robotics · Computer Science 2026-01-14 Zhenyang Liu , Yongchong Gu , Yikai Wang , Xiangyang Xue , Yanwei Fu

ST-VLA: Enabling 4D-Aware Spatiotemporal Understanding for General Robot Manipulation

Robotic manipulation in open-world environments requires reasoning across semantics, geometry, and long-horizon action dynamics. Existing hierarchical Vision-Language-Action (VLA) frameworks typically use 2D representations to connect…

Robotics · Computer Science 2026-03-17 You Wu , Zixuan Chen , Cunxu Ou , Wenxuan Wang , Wenbo Huang , Lin Cao , Yangtao Chen , Weichao Qiu , Xingyue Quan , Jieqi Shi , Jing Huo , Yang Gao

GraphCoT-VLA: A 3D Spatial-Aware Reasoning Vision-Language-Action Model for Robotic Manipulation with Ambiguous Instructions

Vision-language-action models have emerged as a crucial paradigm in robotic manipulation. However, existing VLA models exhibit notable limitations in handling ambiguous language instructions and unknown environmental states. Furthermore,…

Robotics · Computer Science 2025-08-26 Helong Huang , Min Cen , Kai Tan , Xingyue Quan , Guowei Huang , Hong Zhang

GeoVLA: Empowering 3D Representations in Vision-Language-Action Models

Vision-Language-Action (VLA) models have emerged as a promising approach for enabling robots to follow language instructions and predict corresponding actions. However, current VLA models mainly rely on 2D visual inputs, neglecting the rich…

Robotics · Computer Science 2025-08-14 Lin Sun , Bin Xie , Yingfei Liu , Hao Shi , Tiancai Wang , Jiale Cao

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward scalable, general-purpose intelligence. Its core idea is spatially guided vision-language-action…

Robotics · Computer Science 2025-10-16 Xinyi Chen , Yilun Chen , Yanwei Fu , Ning Gao , Jiaya Jia , Weiyang Jin , Hao Li , Yao Mu , Jiangmiao Pang , Yu Qiao , Yang Tian , Bin Wang , Bolun Wang , Fangjing Wang , Hanqing Wang , Tai Wang , Ziqin Wang , Xueyuan Wei , Chao Wu , Shuai Yang , Jinhui Ye , Junqiu Yu , Jia Zeng , Jingjing Zhang , Jinyu Zhang , Shi Zhang , Feng Zheng , Bowen Zhou , Yangkun Zhu

StemVLA:An Open-Source Vision-Language-Action Model with Future 3D Spatial Geometry Knowledge and 4D Historical Representation

Vision-language-action (VLA) models integrate visual observations and language instructions to predict robot actions, demonstrating promising generalization in manipulation tasks. However, most existing approaches primarily rely on direct…

Robotics · Computer Science 2026-03-02 Jiasong Xiao , Yutao She , Kai Li , Yuyang Sha , Ziang Cheng , Ziang Tong

Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding

Vision-Language-Action (VLA) models have emerged as a promising framework for enabling generalist robots capable of perceiving, reasoning, and acting in the real world. These models usually build upon pretrained Vision-Language Models…

Robotics · Computer Science 2025-11-25 Tao Lin , Gen Li , Yilei Zhong , Yanwen Zou , Yuxin Du , Jiting Liu , Encheng Gu , Bo Zhao

Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model

Vision-language-action (VLA) models have recently shown strong potential in enabling robots to follow language instructions and execute precise actions. However, most VLAs are built upon vision-language models pretrained solely on 2D data,…

Robotics · Computer Science 2025-10-20 Fuhao Li , Wenxuan Song , Han Zhao , Jingbo Wang , Pengxiang Ding , Donglin Wang , Long Zeng , Haoang Li

ST4VLA: Spatially Guided Training for Vision-Language-Action Models

Large vision-language models (VLMs) excel at multimodal understanding but fall short when extended to embodied tasks, where instructions must be transformed into low-level motor actions. We introduce ST4VLA, a dual-system…

Robotics · Computer Science 2026-02-11 Jinhui Ye , Fangjing Wang , Ning Gao , Junqiu Yu , Yangkun Zhu , Bin Wang , Jinyu Zhang , Weiyang Jin , Yanwei Fu , Feng Zheng , Yilun Chen , Jiangmiao Pang

Vision Language Action Models in Robotic Manipulation: A Systematic Review

Vision Language Action (VLA) models represent a transformative shift in robotics, with the aim of unifying visual perception, natural language understanding, and embodied control within a single learning framework. This review presents a…

Robotics · Computer Science 2026-01-21 Muhayy Ud Din , Waseem Akram , Lyes Saad Saoud , Jan Rosell , Irfan Hussain

3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks

Robotic manipulation in 3D requires effective computation of N degree-of-freedom joint-space trajectories that enable precise and robust control. To achieve this, robots must integrate semantic understanding with visual perception to…

Robotics · Computer Science 2026-03-31 Vineet Bhat , Yu-Hsiang Lan , Prashanth Krishnamurthy , Ramesh Karri , Farshad Khorrami

MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation

Vision-language-action models (VLAs) have shown generalization capabilities in robotic manipulation tasks by inheriting from vision-language models (VLMs) and learning action generation. Most VLA models focus on interpreting vision and…

Robotics · Computer Science 2026-03-20 Zhuoyang Liu , Jiaming Liu , Jiadong Xu , Nuowei Han , Chenyang Gu , Hao Chen , Kaichen Zhou , Renrui Zhang , Kai Chin Hsieh , Kun Wu , Zhengping Che , Jian Tang , Shanghang Zhang

cVLA: Towards Efficient Camera-Space VLAs

Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive…

Robotics · Computer Science 2025-12-23 Max Argus , Jelena Bratulic , Houman Masnavi , Maxim Velikanov , Nick Heppert , Abhinav Valada , Thomas Brox

Goal-VLA: Image-Generative VLMs as Object-Centric World Models Empowering Zero-shot Robot Manipulation

Generalization remains a fundamental challenge in robotic manipulation. To tackle this challenge, recent Vision-Language-Action (VLA) models build policies on top of Vision-Language Models (VLMs), seeking to transfer their open-world…

Robotics · Computer Science 2026-03-31 Haonan Chen , Jingxiang Guo , Bangjun Wang , Tianrui Zhang , Xuchuan Huang , Boren Zheng , Yiwen Hou , Chenrui Tie , Jiajun Deng , Lin Shao

Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy

Vision-Language-Action (VLA) models frequently encounter challenges in generalizing to real-world environments due to inherent discrepancies between observation and action spaces. Although training data are collected from diverse camera…

Robotics · Computer Science 2025-08-19 Tianyi Zhang , Haonan Duan , Haoran Hao , Yu Qiao , Jifeng Dai , Zhi Hou

GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning

Large foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics. One fundamental challenge is that the models exhibit…

Robotics · Computer Science 2026-02-05 Guoqing Ma , Siheng Wang , Zeyu Zhang , Shan Yu , Hao Tang

Mechanistic interpretability for steering vision-language-action models

Vision-Language-Action (VLA) models are a promising path to realizing generalist embodied agents that can quickly adapt to new tasks, modalities, and environments. However, methods for interpreting and steering VLAs fall far short of…

Robotics · Computer Science 2025-09-03 Bear Häon , Kaylene Stocking , Ian Chuang , Claire Tomlin