Related papers: Spatial Forcing: Implicit Spatial Representation A…

SG-VLA: Learning Spatially-Grounded Vision-Language-Action Models for Mobile Manipulation

Vision-Language-Action (VLA) models show promise for robotic control, yet performance in complex household environments remains sub-optimal. Mobile manipulation requires reasoning about global scene layout, fine-grained geometry, and…

Robotics · Computer Science 2026-03-25 Ruisen Tu , Arth Shukla , Sohyun Yoo , Xuanlin Li , Junxi Li , Jianwen Xie , Hao Su , Zhuowen Tu

Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

Vision-Language-Action (VLA) models provide a promising paradigm for robot learning by integrating visual perception with language-guided policy learning. However, most existing approaches rely on 2D visual inputs to perform actions in 3D…

Robotics · Computer Science 2025-12-16 Yicheng Feng , Wanpeng Zhang , Ye Wang , Hao Luo , Haoqi Yuan , Sipeng Zheng , Zongqing Lu

SA-VLA: Spatially-Aware Flow-Matching for Vision-Language-Action Reinforcement Learning

Vision-Language-Action (VLA) models exhibit strong generalization in robotic manipulation, yet reinforcement learning (RL) fine-tuning often degrades robustness under spatial distribution shifts. For flow-matching VLA policies, this…

Robotics · Computer Science 2026-02-03 Xu Pan , Zhenglin Wan , Xingrui Yu , Xianwei Zheng , Youkai Ke , Ming Sun , Rui Wang , Ziwei Wang , Ivor Tsang

VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models

Precise spatial reasoning is fundamental to robotic manipulation, yet the visual backbones of current vision-language-action (VLA) models are predominantly pretrained on 2D image data without explicit 3D geometric supervision, resulting in…

Robotics · Computer Science 2026-05-12 Hao Wang , Xiaobao Wei , Jingyang He , Chengyu Bai , Chun-Kai Fan , Jiajun Cao , Jintao Chen , Ying Li , Shanyu Rong , Ming Lu , Xiaozhu Ju , Jian Tang , Shanghang Zhang

VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation

Vision-language-action (VLA) models show potential for general robotic tasks, but remain challenging in spatiotemporally coherent manipulation, which requires fine-grained representations. Typically, existing methods embed 3D positions into…

Computer Vision and Pattern Recognition · Computer Science 2025-11-24 Hanyu Zhou , Chuanhao Ma , Gim Hee Lee

Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding

Vision-Language-Action (VLA) models have emerged as a promising framework for enabling generalist robots capable of perceiving, reasoning, and acting in the real world. These models usually build upon pretrained Vision-Language Models…

Robotics · Computer Science 2025-11-25 Tao Lin , Gen Li , Yilei Zhong , Yanwen Zou , Yuxin Du , Jiting Liu , Encheng Gu , Bo Zhao

From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors

Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require…

Robotics · Computer Science 2026-03-11 Zhengshen Zhang , Hao Li , Yalun Dai , Zhengbang Zhu , Lei Zhou , Chenchen Liu , Dong Wang , Francis E. H. Tay , Sijin Chen , Ziwei Liu , Yuxiao Liu , Xinghang Li , Pan Zhou

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

In this paper, we claim that spatial understanding is the keypoint in robot manipulation, and propose SpatialVLA to explore effective spatial representations for the robot foundation model. Specifically, we introduce Ego3D Position Encoding…

Robotics · Computer Science 2025-05-20 Delin Qu , Haoming Song , Qizhi Chen , Yuanqi Yao , Xinyi Ye , Yan Ding , Zhigang Wang , JiaYuan Gu , Bin Zhao , Dong Wang , Xuelong Li

DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning

Vision-Language-Action (VLA) models have recently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spatial…

Computer Vision and Pattern Recognition · Computer Science 2025-10-16 Tianyuan Yuan , Yicheng Liu , Chenhao Lu , Zhuoguang Chen , Tao Jiang , Hang Zhao

ST-VLA: Enabling 4D-Aware Spatiotemporal Understanding for General Robot Manipulation

Robotic manipulation in open-world environments requires reasoning across semantics, geometry, and long-horizon action dynamics. Existing hierarchical Vision-Language-Action (VLA) frameworks typically use 2D representations to connect…

Robotics · Computer Science 2026-03-17 You Wu , Zixuan Chen , Cunxu Ou , Wenxuan Wang , Wenbo Huang , Lin Cao , Yangtao Chen , Weichao Qiu , Xingyue Quan , Jieqi Shi , Jing Huo , Yang Gao

cVLA: Towards Efficient Camera-Space VLAs

Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive…

Robotics · Computer Science 2025-12-23 Max Argus , Jelena Bratulic , Houman Masnavi , Maxim Velikanov , Nick Heppert , Abhinav Valada , Thomas Brox

MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation

Vision-language-action models (VLAs) have shown generalization capabilities in robotic manipulation tasks by inheriting from vision-language models (VLMs) and learning action generation. Most VLA models focus on interpreting vision and…

Robotics · Computer Science 2026-03-20 Zhuoyang Liu , Jiaming Liu , Jiadong Xu , Nuowei Han , Chenyang Gu , Hao Chen , Kaichen Zhou , Renrui Zhang , Kai Chin Hsieh , Kun Wu , Zhengping Che , Jian Tang , Shanghang Zhang

TaF-VLA: Tactile-Force Alignment in Vision-Language-Action Models for Force-aware Manipulation

Vision-Language-Action (VLA) models have recently emerged as powerful generalists for robotic manipulation. However, due to their predominant reliance on visual modalities, they fundamentally lack the physical intuition required for…

Robotics · Computer Science 2026-02-02 Yuzhe Huang , Pei Lin , Wanlin Li , Daohan Li , Jiajun Li , Jiaming Jiang , Chenxi Xiao , Ziyuan Jiao

Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model

Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial…

Computer Vision and Pattern Recognition · Computer Science 2026-05-15 Tao Lin , Yuxin Du , Jiting Liu , Nuobei Zhu , Yunhe Li , Yuqian Fu , Yinxinyu Chen , Hongyi Cai , Zewei Ye , Bing Cheng , Kai Ye , Yiran Mao , Yilei Zhong , MingKang Dong , Junchi Yan , Gen Li , Bo Zhao

ST4VLA: Spatially Guided Training for Vision-Language-Action Models

Large vision-language models (VLMs) excel at multimodal understanding but fall short when extended to embodied tasks, where instructions must be transformed into low-level motor actions. We introduce ST4VLA, a dual-system…

Robotics · Computer Science 2026-02-11 Jinhui Ye , Fangjing Wang , Ning Gao , Junqiu Yu , Yangkun Zhu , Bin Wang , Jinyu Zhang , Weiyang Jin , Yanwei Fu , Feng Zheng , Yilun Chen , Jiangmiao Pang

ROSA: Harnessing Robot States for Vision-Language and Action Alignment

Vision-Language-Action (VLA) models have recently made significant advance in multi-task, end-to-end robotic control, due to the strong generalization capabilities of Vision-Language Models (VLMs). A fundamental challenge in developing such…

Robotics · Computer Science 2025-06-17 Yuqing Wen , Kefan Gu , Haoxuan Liu , Yucheng Zhao , Tiancai Wang , Haoqiang Fan , Xiaoyan Sun

StemVLA:An Open-Source Vision-Language-Action Model with Future 3D Spatial Geometry Knowledge and 4D Historical Representation

Vision-language-action (VLA) models integrate visual observations and language instructions to predict robot actions, demonstrating promising generalization in manipulation tasks. However, most existing approaches primarily rely on direct…

Robotics · Computer Science 2026-03-02 Jiasong Xiao , Yutao She , Kai Li , Yuyang Sha , Ziang Cheng , Ziang Tong

ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation

Recent advances in robot manipulation have leveraged pre-trained vision-language models (VLMs) and explored integrating 3D spatial signals into these models for effective action prediction, giving rise to the promising…

Robotics · Computer Science 2026-01-14 Zhenyang Liu , Yongchong Gu , Yikai Wang , Xiangyang Xue , Yanwei Fu

Can Explicit Physical Feasibility Benefit VLA Learning? An Empirical Study

Vision-Language-Action (VLA) models map multimodal inputs directly to robot actions and are typically trained through large-scale imitation learning. While this paradigm has shown strong performance, prevailing VLA training procedures do…

Machine Learning · Computer Science 2026-05-06 Yubai Wei , Chen Wu , Hashem Haghbayan

Spatial Understanding from Videos: Structured Prompts Meet Simulation Data

Visual-spatial understanding, the ability to infer object relationships and layouts from visual input, is fundamental to downstream tasks such as robotic navigation and embodied interaction. However, existing methods face spatial…

Computer Vision and Pattern Recognition · Computer Science 2025-09-22 Haoyu Zhang , Meng Liu , Zaijing Li , Haokun Wen , Weili Guan , Yaowei Wang , Liqiang Nie