Related papers: Diffusion Transformer Policy

dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought

Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics. We introduce dVLA, a diffusion-based VLA that leverages a multimodal chain-of-thought to unify visual perception, language reasoning, and robotic…

Robotics · Computer Science 2025-10-01 Junjie Wen , Minjie Zhu , Jiaming Liu , Zhiyuan Liu , Yicun Yang , Linfeng Zhang , Shanghang Zhang , Yichen Zhu , Yi Xu

RoLD: Robot Latent Diffusion for Multi-task Policy Modeling

Modeling generalized robot control policies poses ongoing challenges for language-guided robot manipulation tasks. Existing methods often struggle to efficiently utilize cross-dataset resources or rely on resource-intensive vision-language…

Robotics · Computer Science 2024-11-05 Wenhui Tan , Bei Liu , Junbo Zhang , Ruihua Song , Jianlong Fu

Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic Manipulation

Diffusion Policy is a powerful technique tool for learning end-to-end visuomotor robot control. It is expected that Diffusion Policy possesses scalability, a key attribute for deep neural networks, typically suggesting that increasing model…

Robotics · Computer Science 2024-11-15 Minjie Zhu , Yichen Zhu , Jinming Li , Junjie Wen , Zhiyuan Xu , Ning Liu , Ran Cheng , Chaomin Shen , Yaxin Peng , Feifei Feng , Jian Tang

Tenma: Robust Cross-Embodiment Robot Manipulation with Diffusion Transformer

Scaling Transformer policies and diffusion models has advanced robotic manipulation, yet combining these techniques in lightweight, cross-embodiment learning settings remains challenging. We study design choices that most affect stability…

Robotics · Computer Science 2025-09-16 Travis Davies , Yiqi Huang , Yunxin Liu , Xiang Chen , Huxian Liu , Luhui Hu

DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy

We propose DemoDiffusion, a simple method for enabling robots to perform manipulation tasks by imitating a single human demonstration, without requiring task-specific training or paired human-robot data. Our approach is based on two…

Robotics · Computer Science 2026-03-10 Sungjae Park , Homanga Bharadhwaj , Shubham Tulsiani

Inference-stage Adaptation-projection Strategy Adapts Diffusion Policy to Cross-manipulators Scenarios

Diffusion policies are powerful visuomotor models for robotic manipulation, yet they often fail to generalize to manipulators or end-effectors unseen during training and struggle to accommodate new task requirements at inference time.…

Robotics · Computer Science 2025-09-16 Xiangtong Yao , Yirui Zhou , Yuan Meng , Yanwen Liu , Liangyu Dong , Zitao Zhang , Zhenshan Bing , Kai Huang , Fuchun Sun , Alois Knoll

Pick-and-place Manipulation Across Grippers Without Retraining: A Learning-optimization Diffusion Policy Approach

Current robotic pick-and-place policies typically require consistent gripper configurations across training and inference. This constraint imposes high retraining or fine-tuning costs, especially for imitation learning-based approaches,…

Robotics · Computer Science 2025-02-24 Xiangtong Yao , Yirui Zhou , Yuan Meng , Liangyu Dong , Lin Hong , Zitao Zhang , Zhenshan Bing , Kai Huang , Fuchun Sun , Alois Knoll

mimic-one: a Scalable Model Recipe for General Purpose Robot Dexterity

We present a diffusion-based model recipe for real-world control of a highly dexterous humanoid robotic hand, designed for sample-efficient learning and smooth fine-motor action inference. Our system features a newly designed 16-DoF…

Robotics · Computer Science 2025-06-16 Elvis Nava , Victoriano Montesinos , Erik Bauer , Benedek Forrai , Jonas Pai , Stefan Weirich , Stephan-Daniel Gravert , Philipp Wand , Stephan Polinski , Benjamin F. Grewe , Robert K. Katzschmann

LatBot: Distilling Universal Latent Actions for Vision-Language-Action Models

Learning transferable latent actions from large-scale object manipulation videos can significantly enhance generalization in downstream robotics tasks, as such representations are agnostic to different robot embodiments. Existing approaches…

Robotics · Computer Science 2025-12-01 Zuolei Li , Xingyu Gao , Xiaofan Wang , Jianlong Fu

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions auto-regressively in a fixed left-to-right order or attach…

Computer Vision and Pattern Recognition · Computer Science 2025-12-23 Zhixuan Liang , Yizhuo Li , Tianshuo Yang , Chengyue Wu , Sitong Mao , Tian Nian , Liuao Pei , Shunbo Zhou , Xiaokang Yang , Jiangmiao Pang , Yao Mu , Ping Luo

Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training

Learning a generalist embodied agent capable of completing multiple tasks poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. In contrast, a vast amount of human videos exist, capturing intricate tasks…

Machine Learning · Computer Science 2024-10-10 Haoran He , Chenjia Bai , Ling Pan , Weinan Zhang , Bin Zhao , Xuelong Li

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single…

Robotics · Computer Science 2025-11-04 Qingwen Bu , Yanting Yang , Jisong Cai , Shenyuan Gao , Guanghui Ren , Maoqing Yao , Ping Luo , Hongyang Li

The Ingredients for Robotic Diffusion Transformers

In recent years roboticists have achieved remarkable progress in solving increasingly general tasks on dexterous robotic hardware by leveraging high capacity Transformer network architectures and generative diffusion models. Unfortunately,…

Robotics · Computer Science 2024-10-15 Sudeep Dasari , Oier Mees , Sebastian Zhao , Mohan Kumar Srirama , Sergey Levine

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

This paper introduces Diffusion Policy, a new way of generating robot behavior by representing a robot's visuomotor policy as a conditional denoising diffusion process. We benchmark Diffusion Policy across 12 different tasks from 4…

Robotics · Computer Science 2024-03-15 Cheng Chi , Zhenjia Xu , Siyuan Feng , Eric Cousineau , Yilun Du , Benjamin Burchfiel , Russ Tedrake , Shuran Song

Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation

Learning visuomotor policy for multi-task robotic manipulation has been a long-standing challenge for the robotics community. The difficulty lies in the diversity of action space: typically, a goal can be accomplished in multiple ways,…

Robotics · Computer Science 2025-03-24 Kun Wu , Yichen Zhu , Jinming Li , Junjie Wen , Ning Liu , Zhiyuan Xu , Jian Tang

Diffusion Stabilizer Policy for Automated Surgical Robot Manipulations

Intelligent surgical robots have the potential to revolutionize clinical practice by enabling more precise and automated surgical procedures. However, the automation of such robot for surgical tasks remains under-explored compared to recent…

Robotics · Computer Science 2026-03-10 Chonlam Ho , Jianshu Hu , Lei Song , Hesheng Wang , Qi Dou , Yutong Ban

Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition

Diffusion-based models for robotic control, including vision-language-action (VLA) and vision-action (VA) policies, have demonstrated significant capabilities. Yet their advancement is constrained by the high cost of acquiring large-scale…

Robotics · Computer Science 2026-03-11 Jiahang Cao , Yize Huang , Hanzhong Guo , Rui Zhang , Mu Nan , Weijian Mai , Jiaxu Wang , Hao Cheng , Jingkai Sun , Gang Han , Wen Zhao , Qiang Zhang , Yijie Guo , Qihao Zheng , Chunfeng Song , Xiao Li , Ping Luo , Andrew F. Luo

Diffusion-VLA: Generalizable and Interpretable Robot Foundation Model via Self-Generated Reasoning

In this paper, we present DiffusionVLA, a novel framework that seamlessly combines the autoregression model with the diffusion model for learning visuomotor policy. Central to our approach is a next-token prediction objective, enabling the…

Robotics · Computer Science 2025-06-05 Junjie Wen , Minjie Zhu , Yichen Zhu , Zhibin Tang , Jinming Li , Zhongyi Zhou , Chengmeng Li , Xiaoyu Liu , Yaxin Peng , Chaomin Shen , Feifei Feng

Train-Small Deploy-Large: Leveraging Diffusion-Based Multi-Robot Planning

Learning based multi-robot path planning methods struggle to scale or generalize to changes, particularly variations in the number of robots during deployment. Most existing methods are trained on a fixed number of robots and may tolerate a…

Robotics · Computer Science 2026-04-09 Siddharth Singh , Soumee Guha , Qing Chang , Scott Acton

Latent Action Diffusion for Cross-Embodiment Manipulation

End-to-end learning is emerging as a powerful paradigm for robotic manipulation, but its effectiveness is limited by data scarcity and the heterogeneity of action spaces across robot embodiments. In particular, diverse action spaces across…

Robotics · Computer Science 2026-03-23 Erik Bauer , Elvis Nava , Robert K. Katzschmann