相关论文: Action with Visual Primitives

Vision-Language-Action (VLA) Models: Concepts, Progress, Applications and Challenges

Vision-Language-Action (VLA) models mark a transformative advancement in artificial intelligence, aiming to unify perception, natural language understanding, and embodied action within a single computational framework. This foundational…

计算机视觉与模式识别 · 计算机科学 2026-02-02 Ranjan Sapkota , Yang Cao , Konstantinos I. Roumeliotis , Manoj Karkee

PrimitiveVLA: Learning Reusable Motion Primitives for Efficient and Generalizable Robotic Manipulation

Vision-Language-Action (VLA) models offer a promising paradigm for generalist robotic policies, yet their adaptation is hindered by data inefficiency and poor generalization. We argue that these bottlenecks stem from the prevailing Direct…

机器人学 · 计算机科学 2026-05-28 Yutai Li , Shaohui Peng , Jiaming Guo , Di Huang , Zihao Zhang , Yuxuan Guo , Yunkai Gao , Siming Lan , Ling Li , Xing Hu , Yunji Chen

Avi: Action from Volumetric Inference

We propose Avi, a novel 3D Vision-Language-Action (VLA) architecture that reframes robotic action generation as a problem of 3D perception and spatial reasoning, rather than low-level policy learning. While existing VLA models primarily…

机器人学 · 计算机科学 2025-10-28 Harris Song , Long Le

cVLA: Towards Efficient Camera-Space VLAs

Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive…

机器人学 · 计算机科学 2025-12-23 Max Argus , Jelena Bratulic , Houman Masnavi , Maxim Velikanov , Nick Heppert , Abhinav Valada , Thomas Brox

Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations

Vision-language-action (VLA) models finetuned from vision-language models (VLMs) hold the promise of leveraging rich pretrained representations to build generalist robots across diverse tasks and environments. However, direct fine-tuning on…

机器人学 · 计算机科学 2025-09-18 Shresth Grover , Akshay Gopalkrishnan , Bo Ai , Henrik I. Christensen , Hao Su , Xuanlin Li

OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

The development of general robotic systems capable of manipulating in unstructured environments is a significant challenge. While Vision-Language Models(VLM) excel in high-level commonsense reasoning, they lack the fine-grained 3D spatial…

机器人学 · 计算机科学 2025-01-08 Mingjie Pan , Jiyao Zhang , Tianshu Wu , Yinghao Zhao , Wenlong Gao , Hao Dong

SimVLA: A Simple VLA Baseline for Robotic Manipulation

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic manipulation, leveraging large-scale pre-training to achieve strong performance. The field has rapidly evolved with additional spatial…

机器人学 · 计算机科学 2026-02-23 Yuankai Luo , Woping Chen , Tong Liang , Baiqiao Wang , Zhenguo Li

APPLV: Adaptive Planner Parameter Learning from Vision-Language-Action Model

Autonomous navigation in highly constrained environments remains challenging for mobile robots. Classical navigation approaches offer safety assurances but require environment-specific parameter tuning; end-to-end learning bypasses…

机器人学 · 计算机科学 2026-03-11 Yuanjie Lu , Beichen Wang , Zhengqi Wu , Yang Li , Xiaomin Lin , Chengzhi Mao , Xuesu Xiao

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

The advancement of large Vision-Language-Action (VLA) models has significantly improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. While existing VLAs adapted from pretrained…

机器人学 · 计算机科学 2024-12-02 Qixiu Li , Yaobo Liang , Zeyu Wang , Lin Luo , Xi Chen , Mozheng Liao , Fangyun Wei , Yu Deng , Sicheng Xu , Yizhong Zhang , Xiaofan Wang , Bei Liu , Jianlong Fu , Jianmin Bao , Dong Chen , Yuanchun Shi , Jiaolong Yang , Baining Guo

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale Vision-Language Model (VLM) on robotic data. While this approach greatly enhances performance, it also incurs…

机器人学 · 计算机科学 2025-09-23 Yihao Wang , Pengxiang Ding , Lingxiao Li , Can Cui , Zirui Ge , Xinyang Tong , Wenxuan Song , Han Zhao , Wei Zhao , Pengxu Hou , Siteng Huang , Yifan Tang , Wenhui Wang , Ru Zhang , Jianyi Liu , Donglin Wang

AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

Vision-Language-Action (VLA) models have shown remarkable progress in embodied tasks recently, but most methods process visual observations independently at each timestep. This history-agnostic design treats robot manipulation as a Markov…

机器学习 · 计算机科学 2026-04-13 Lei Xiao , Jifeng Li , Juntao Gao , Feiyang Ye , Yan Jin , Jingjing Qian , Jing Zhang , Yong Wu , Xiaoyuan Yu

Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

Autonomous driving has long relied on modular "Perception-Decision-Action" pipelines, where hand-crafted interfaces and rule-based components often break down in complex or long-tailed scenarios. Their cascaded design further propagates…

机器人学 · 计算机科学 2026-01-06 Tianshuai Hu , Xiaolu Liu , Song Wang , Yiyao Zhu , Ao Liang , Lingdong Kong , Guoyang Zhao , Zeying Gong , Jun Cen , Zhiyu Huang , Xiaoshuai Hao , Linfeng Li , Hang Song , Xiangtai Li , Jun Ma , Shaojie Shen , Jianke Zhu , Dacheng Tao , Ziwei Liu , Junwei Liang

See Once, Then Act: Vision-Language-Action Model with Task Learning from One-Shot Video Demonstrations

Developing robust and general-purpose manipulation policies represents a fundamental objective in robotics research. While Vision-Language-Action (VLA) models have demonstrated promising capabilities for end-to-end robot control, existing…

机器人学 · 计算机科学 2025-12-09 Guangyan Chen , Meiling Wang , Qi Shao , Zichen Zhou , Weixin Mao , Te Cui , Minzhao Zhu , Yinan Deng , Luojie Yang , Zhanqi Zhang , Yi Yang , Hua Chen , Yufeng Yue

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Vision-Language-Action (VLA) models have emerged as a popular paradigm for learning robot manipulation policies that can follow language instructions and generalize to novel scenarios. Recent works have begun to explore the incorporation of…

机器人学 · 计算机科学 2025-09-26 Xiaoyu Chen , Hangxing Wei , Pushi Zhang , Chuheng Zhang , Kaixin Wang , Yanjiang Guo , Rushuai Yang , Yucen Wang , Xinquan Xiao , Li Zhao , Jianyu Chen , Jiang Bian

Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications

Amid growing efforts to leverage advances in large language models (LLMs) and vision-language models (VLMs) for robotics, Vision-Language-Action (VLA) models have recently gained significant attention. By unifying vision, language, and…

机器人学 · 计算机科学 2025-10-09 Kento Kawaharazuka , Jihoon Oh , Jun Yamada , Ingmar Posner , Yuke Zhu

Unified Vision-Language-Action Model

Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language…

计算机视觉与模式识别 · 计算机科学 2025-06-25 Yuqi Wang , Xinghang Li , Wenxuan Wang , Junbo Zhang , Yingyan Li , Yuntao Chen , Xinlong Wang , Zhaoxiang Zhang

VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs). By integrating action modules into these pretrained models, VLA…

计算机视觉与模式识别 · 计算机科学 2025-10-20 Shaoqi Dong , Chaoyou Fu , Haihan Gao , Yi-Fan Zhang , Chi Yan , Chu Wu , Xiaoyu Liu , Yunhang Shen , Jing Huo , Deqiang Jiang , Haoyu Cao , Yang Gao , Xing Sun , Ran He , Caifeng Shan

Pure Vision Language Action (VLA) Models: A Comprehensive Survey

The emergence of Vision Language Action (VLA) models marks a paradigm shift from traditional policy-based control to generalized robotics, reframing Vision Language Models (VLMs) from passive sequence generators into active agents for…

机器人学 · 计算机科学 2025-11-11 Dapeng Zhang , Jing Sun , Chenghui Hu , Xiaoyan Wu , Zhenlong Yuan , Rui Zhou , Fei Shen , Qingguo Zhou

From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, "generalist" robot…

机器人学 · 计算机科学 2025-06-12 Irving Fang , Juexiao Zhang , Shengbang Tong , Chen Feng

Bridge Thinking and Acting: Unleashing Physical Potential of VLM with Generalizable Action Expert

Although Vision-Language Models (VLM) have demonstrated impressive planning and reasoning capabilities, translating these abilities into the physical world introduces significant challenges. Conventional Vision-Language-Action (VLA) models,…

计算机视觉与模式识别 · 计算机科学 2025-10-07 Mingyu Liu , Zheng Huang , Xiaoyi Lin , Muzhi Zhu , Canyu Zhao , Zongze Du , Yating Wang , Haoyi Zhu , Hao Chen , Chunhua Shen