Related papers: Programmatically Grounded, Compositionally General…

Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations

Vision-language-action (VLA) models finetuned from vision-language models (VLMs) hold the promise of leveraging rich pretrained representations to build generalist robots across diverse tasks and environments. However, direct fine-tuning on…

Robotics · Computer Science 2025-09-18 Shresth Grover , Akshay Gopalkrishnan , Bo Ai , Henrik I. Christensen , Hao Su , Xuanlin Li

Learning Neuro-symbolic Programs for Language Guided Robot Manipulation

Given a natural language instruction and an input scene, our goal is to train a model to output a manipulation program that can be executed by the robot. Prior approaches for this task possess one of the following limitations: (i) rely on…

Robotics · Computer Science 2024-04-03 Namasivayam Kalithasan , Himanshu Singh , Vishal Bindal , Arnav Tuli , Vishwajeet Agrawal , Rahul Jain , Parag Singla , Rohan Paul

Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation

Robotic manipulation faces a significant challenge in generalizing across unseen objects, environments and tasks specified by diverse language instructions. To improve generalization capabilities, recent research has incorporated large…

Robotics · Computer Science 2025-06-16 Shizhe Chen , Ricardo Garcia , Paul Pacaud , Cordelia Schmid

CLIPort: What and Where Pathways for Robotic Manipulation

How can we imbue robots with the ability to manipulate objects precisely but also to reason about them in terms of abstract concepts? Recent works in manipulation have shown that end-to-end networks can learn dexterous skills that require…

Robotics · Computer Science 2021-09-27 Mohit Shridhar , Lucas Manuelli , Dieter Fox

SONAR: Semantic-Object Navigation with Aggregated Reasoning through a Cross-Modal Inference Paradigm

Understanding human instructions and accomplishing Vision-Language Navigation tasks in unknown environments is essential for robots. However, existing modular approaches heavily rely on the quality of training data and often exhibit poor…

Robotics · Computer Science 2025-09-30 Yao Wang , Zhirui Sun , Wenzheng Chi , Baozhi Jia , Wenjun Xu , Jiankun Wang

Rethinking Intermediate Representation for VLM-based Robot Manipulation

Vision-Language Model (VLM) is an important component to enable robust robot manipulation. Yet, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between…

Robotics · Computer Science 2025-11-25 Weiliang Tang , Jialin Gao , Jia-Hui Pan , Gang Wang , Li Erran Li , Yunhui Liu , Mingyu Ding , Pheng-Ann Heng , Chi-Wing Fu

Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control

Pretrained vision-language models (VLMs) can make semantic and visual inferences across diverse settings, providing valuable common-sense priors for robotic control. However, effectively grounding this knowledge in robot behaviors remains…

Robotics · Computer Science 2026-04-07 William Chen , Jagdeep Singh Bhatia , Catherine Glossop , Nikhil Mathihalli , Ria Doshi , Andy Tang , Danny Driess , Karl Pertsch , Sergey Levine

Pre-Trained Language Models for Interactive Decision-Making

Language model (LM) pre-training is useful in many language processing tasks. But can pre-trained LMs be further leveraged for more general machine learning problems? We propose an approach for using LMs to scaffold learning and…

Machine Learning · Computer Science 2022-11-01 Shuang Li , Xavier Puig , Chris Paxton , Yilun Du , Clinton Wang , Linxi Fan , Tao Chen , De-An Huang , Ekin Akyürek , Anima Anandkumar , Jacob Andreas , Igor Mordatch , Antonio Torralba , Yuke Zhu

LocoVLM: Grounding Vision and Language for Adapting Versatile Legged Locomotion Policies

Recent advances in legged locomotion learning are still dominated by the utilization of geometric representations of the environment, limiting the robot's capability to respond to higher-level semantics such as human instructions. To…

Robotics · Computer Science 2026-02-12 I Made Aswin Nahrendra , Seunghyun Lee , Dongkyu Lee , Hyun Myung

Improving Generalization of Language-Conditioned Robot Manipulation

The control of robots for manipulation tasks generally relies on visual input. Recent advances in vision-language models (VLMs) enable the use of natural language instructions to condition visual input and control robots in a wider range of…

Robotics · Computer Science 2025-08-05 Chenglin Cui , Chaoran Zhu , Changjae Oh , Andrea Cavallaro

Robust Finetuning of Vision-Language-Action Robot Policies via Parameter Merging

Generalist robot policies, trained on large and diverse datasets, have demonstrated the ability to generalize across a wide spectrum of behaviors, enabling a single policy to act in varied real-world environments. However, they still fall…

Robotics · Computer Science 2026-03-03 Yajat Yadav , Zhiyuan Zhou , Andrew Wagenmaker , Karl Pertsch , Sergey Levine

Semantically Controllable Augmentations for Generalizable Robot Learning

Generalization to unseen real-world scenarios for robot manipulation requires exposure to diverse datasets during training. However, collecting large real-world datasets is intractable due to high operational costs. For robot learning to…

Robotics · Computer Science 2024-09-04 Zoey Chen , Zhao Mandi , Homanga Bharadhwaj , Mohit Sharma , Shuran Song , Abhishek Gupta , Vikash Kumar

Towards Natural Language-Driven Assembly Using Foundation Models

Large Language Models (LLMs) and strong vision models have enabled rapid research and development in the field of Vision-Language-Action models that enable robotic control. The main objective of these methods is to develop a generalist…

Robotics · Computer Science 2024-06-25 Omkar Joglekar , Tal Lancewicki , Shir Kozlovsky , Vladimir Tchuiev , Zohar Feldman , Dotan Di Castro

Language Models are General-Purpose Interfaces

Foundation models have received much attention due to their effectiveness across a broad range of downstream applications. Though there is a big convergence in terms of architecture, most pretrained models are typically still developed for…

Computation and Language · Computer Science 2022-06-14 Yaru Hao , Haoyu Song , Li Dong , Shaohan Huang , Zewen Chi , Wenhui Wang , Shuming Ma , Furu Wei

Object-Centric Action-Enhanced Representations for Robot Visuo-Motor Policy Learning

Learning visual representations from observing actions to benefit robot visuo-motor policy generation is a promising direction that closely resembles human cognitive function and perception. Motivated by this, and further inspired by…

Robotics · Computer Science 2025-05-28 Nikos Giannakakis , Argyris Manetas , Panagiotis P. Filntisis , Petros Maragos , George Retsinas

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, most still rely on pre-defined motion primitives to…

Robotics · Computer Science 2023-11-03 Wenlong Huang , Chen Wang , Ruohan Zhang , Yunzhu Li , Jiajun Wu , Li Fei-Fei

Versatile and Generalizable Manipulation via Goal-Conditioned Reinforcement Learning with Grounded Object Detection

General-purpose robotic manipulation, including reach and grasp, is essential for deployment into households and workspaces involving diverse and evolving tasks. Recent advances propose using large pre-trained models, such as Large Language…

Robotics · Computer Science 2025-07-16 Huiyi Wang , Fahim Shahriar , Alireza Azimi , Gautham Vasan , Rupam Mahmood , Colin Bellinger

Grounding Language Plans in Demonstrations Through Counterfactual Perturbations

Grounding the common-sense reasoning of Large Language Models (LLMs) in physical domains remains a pivotal yet unsolved problem for embodied AI. Whereas prior works have focused on leveraging LLMs directly for planning in symbolic spaces,…

Robotics · Computer Science 2024-12-10 Yanwei Wang , Tsun-Hsuan Wang , Jiayuan Mao , Michael Hagenow , Julie Shah

From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems

Foundation models (FMs) are increasingly used to bridge language and action in embodied agents, yet the operational characteristics of different FM integration strategies remain under-explored -- particularly for complex instruction…

Robotics · Computer Science 2025-11-04 Xiuchao Sui , Daiying Tian , Qi Sun , Ruirui Chen , Dongkyu Choi , Kenneth Kwok , Soujanya Poria

RoboGround: Robotic Manipulation with Grounded Vision-Language Priors

Recent advancements in robotic manipulation have highlighted the potential of intermediate representations for improving policy generalization. In this work, we explore grounding masks as an effective intermediate representation, balancing…

Robotics · Computer Science 2025-05-01 Haifeng Huang , Xinyi Chen , Yilun Chen , Hao Li , Xiaoshen Han , Zehan Wang , Tai Wang , Jiangmiao Pang , Zhou Zhao