Related papers: SIMPACT: Simulation-Enabled Action Planning using …

IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models

Motion planning involves determining a sequence of robot configurations to reach a desired pose, subject to movement and safety constraints. Traditional motion planning finds collision-free paths, but this is overly restrictive in clutter,…

Robotics · Computer Science 2026-03-10 Yiyang Ling , Karan Owalekar , Oluwatobiloba Adesanya , Erdem Bıyık , Daniel Seita

DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

Robotic manipulation requires sophisticated commonsense reasoning, a capability naturally possessed by large-scale Vision-Language Models (VLMs). While VLMs show promise as zero-shot planners, their lack of grounded physical understanding…

Robotics · Computer Science 2026-03-18 Emily Yue-Ting Jia , Weiduo Yuan , Tianheng Shi , Vitor Guizilini , Jiageng Mao , Yue Wang

APEX: Empowering LLMs with Physics-Based Task Planning for Real-time Insight

Large Language Models (LLMs) demonstrate strong reasoning and task planning capabilities but remain fundamentally limited in physical interaction modeling. Existing approaches integrate perception via Vision-Language Models (VLMs) or…

Robotics · Computer Science 2025-10-17 Wanjing Huang , Weixiang Yan , Zhen Zhang , Ambuj Singh

Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation

Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities, the ability to reason about the physical world, and reactively choose appropriate motor skills. Vision-language models…

Robotics · Computer Science 2025-02-25 Yunhai Feng , Jiaming Han , Zhuoran Yang , Xiangyu Yue , Sergey Levine , Jianlan Luo

PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding. This opens the door to richer interaction with the world, for example robotic control. However, VLMs…

Robotics · Computer Science 2024-02-13 Soroush Nasiriany , Fei Xia , Wenhao Yu , Ted Xiao , Jacky Liang , Ishita Dasgupta , Annie Xie , Danny Driess , Ayzaan Wahid , Zhuo Xu , Quan Vuong , Tingnan Zhang , Tsang-Wei Edward Lee , Kuang-Huei Lee , Peng Xu , Sean Kirmani , Yuke Zhu , Andy Zeng , Karol Hausman , Nicolas Heess , Chelsea Finn , Sergey Levine , Brian Ichter

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end…

Computer Vision and Pattern Recognition · Computer Science 2025-09-19 Chi-Pin Huang , Yueh-Hua Wu , Min-Hung Chen , Yu-Chiang Frank Wang , Fu-En Yang

RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics

Visual Language Models (VLMs) have emerged as pivotal tools for robotic systems, enabling cross-task generalization, dynamic environmental interaction, and long-horizon planning through multimodal perception and semantic reasoning. However,…

Robotics · Computer Science 2025-04-04 Zhiyuan Zhang , Yuxin He , Yong Sun , Junyu Shi , Lijiang Liu , Qiang Nie

Vision-Language-Policy Model for Dynamic Robot Task Planning

Bridging the gap between natural language commands and autonomous execution in unstructured environments remains an open challenge for robotics. This requires robots to perceive and reason over the current task scene through multiple…

Robotics · Computer Science 2025-12-23 Jin Wang , Kim Tien Ly , Jacques Cloete , Nikos Tsagarakis , Ioannis Havoutis

Physically Grounded Vision-Language Models for Robotic Manipulation

Recent advances in vision-language models (VLMs) have led to improved performance on tasks such as visual question answering and image captioning. Consequently, these models are now well-positioned to reason about the physical world,…

Robotics · Computer Science 2024-03-05 Jensen Gao , Bidipta Sarkar , Fei Xia , Ted Xiao , Jiajun Wu , Brian Ichter , Anirudha Majumdar , Dorsa Sadigh

Unfettered Forceful Skill Acquisition with Physical Reasoning and Coordinate Frame Labeling

Vision language models (VLMs) exhibit vast knowledge of the physical world, including intuition of physical and spatial properties, affordances, and motion. With fine-tuning, VLMs can also natively produce robot trajectories. We demonstrate…

Robotics · Computer Science 2025-05-16 William Xie , Max Conway , Yutong Zhang , Nikolaus Correll

ExploreVLM: Closed-Loop Robot Exploration Task Planning with Vision-Language Models

The advancement of embodied intelligence is accelerating the integration of robots into daily life as human assistants. This evolution requires robots to not only interpret high-level instructions and plan tasks but also perceive and adapt…

Robotics · Computer Science 2025-08-19 Zhichen Lou , Kechun Xu , Zhongxiang Zhou , Rong Xiong

RePLan: Robotic Replanning with Perception and Language Models

Advancements in large language models (LLMs) have demonstrated their potential in facilitating high-level reasoning, logical reasoning and robotics planning. Recently, LLMs have also been able to generate reward functions for low-level…

Robotics · Computer Science 2024-02-21 Marta Skreta , Zihan Zhou , Jia Lin Yuan , Kourosh Darvish , Alán Aspuru-Guzik , Animesh Garg

VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation

Although Model Predictive Control (MPC) can effectively predict the future states of a system and thus is widely used in robotic manipulation tasks, it does not have the capability of environmental perception, leading to the failure in some…

Robotics · Computer Science 2024-07-16 Wentao Zhao , Jiaming Chen , Ziyu Meng , Donghui Mao , Ran Song , Wei Zhang

Open-World Task and Motion Planning via Vision-Language Model Generated Constraints

Foundation models like Vision-Language Models (VLMs) excel at common sense vision and language tasks such as visual question answering. However, they cannot yet directly solve complex, long-horizon robot manipulation problems requiring…

Robotics · Computer Science 2026-03-12 Nishanth Kumar , William Shen , Fabio Ramos , Dieter Fox , Tomás Lozano-Pérez , Leslie Pack Kaelbling , Caelan Reed Garrett

Seeing Farther and Smarter: Value-Guided Multi-Path Reflection for VLM Policy Optimization

Solving complex, long-horizon robotic manipulation tasks requires a deep understanding of physical interactions, reasoning about their long-term consequences, and precise high-level planning. Vision-Language Models (VLMs) offer a general…

Robotics · Computer Science 2026-02-24 Yanting Yang , Shenyuan Gao , Qingwen Bu , Li Chen , Dimitris N. Metaxas

PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?

Vision-Language Models (VLMs) are increasingly pivotal for generalist robot manipulation, enabling tasks such as physical reasoning, policy generation, and failure detection. However, their proficiency in these high-level applications often…

Robotics · Computer Science 2025-07-01 Atharva Gundawar , Som Sagar , Ransalu Senanayake

VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use

While vision-language models (VLMs) have demonstrated remarkable performance across various tasks combining textual and visual information, they continue to struggle with fine-grained visual perception tasks that require detailed…

Computation and Language · Computer Science 2025-11-12 Zhehao Zhang , Ryan Rossi , Tong Yu , Franck Dernoncourt , Ruiyi Zhang , Jiuxiang Gu , Sungchul Kim , Xiang Chen , Zichao Wang , Nedim Lipka

Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

We present a framework for perspective-aware reasoning in vision-language models (VLMs) through mental imagery simulation. Perspective-taking, the ability to perceive an environment or situation from an alternative viewpoint, is a key…

Computer Vision and Pattern Recognition · Computer Science 2025-04-25 Phillip Y. Lee , Jihyeon Je , Chanho Park , Mikaela Angelina Uy , Leonidas Guibas , Minhyuk Sung

Rethinking Intermediate Representation for VLM-based Robot Manipulation

Vision-Language Model (VLM) is an important component to enable robust robot manipulation. Yet, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between…

Robotics · Computer Science 2025-11-25 Weiliang Tang , Jialin Gao , Jia-Hui Pan , Gang Wang , Li Erran Li , Yunhui Liu , Mingyu Ding , Pheng-Ann Heng , Chi-Wing Fu

Synthesizing Visual Concepts as Vision-Language Programs

Vision-Language models (VLMs) achieve strong performance on multimodal tasks but often fail at systematic visual reasoning tasks, leading to inconsistent or illogical outputs. Neuro-symbolic methods promise to address this by inducing…

Artificial Intelligence · Computer Science 2025-11-25 Antonia Wüst , Wolfgang Stammer , Hikaru Shindo , Lukas Helff , Devendra Singh Dhami , Kristian Kersting