English

Learning Manipulation by Predicting Interaction

Robotics 2024-06-04 v1 Computer Vision and Pattern Recognition

Abstract

Representation learning approaches for robotic manipulation have boomed in recent years. Due to the scarcity of in-domain robot data, prevailing methodologies tend to leverage large-scale human video datasets to extract generalizable features for visuomotor policy learning. Despite the progress achieved, prior endeavors disregard the interactive dynamics that capture behavior patterns and physical interaction during the manipulation process, resulting in an inadequate understanding of the relationship between objects and the environment. To this end, we propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction (MPI) and enhances the visual representation.Given a pair of keyframes representing the initial and final states, along with language instructions, our algorithm predicts the transition frame and detects the interaction object, respectively. These two learning objectives achieve superior comprehension towards "how-to-interact" and "where-to-interact". We conduct a comprehensive evaluation of several challenging robotic tasks.The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms as well as simulation environments. Code and checkpoints are publicly shared at https://github.com/OpenDriveLab/MPI.

Keywords

Cite

@article{arxiv.2406.00439,
  title  = {Learning Manipulation by Predicting Interaction},
  author = {Jia Zeng and Qingwen Bu and Bangjun Wang and Wenke Xia and Li Chen and Hao Dong and Haoming Song and Dong Wang and Di Hu and Ping Luo and Heming Cui and Bin Zhao and Xuelong Li and Yu Qiao and Hongyang Li},
  journal= {arXiv preprint arXiv:2406.00439},
  year   = {2024}
}

Comments

Accepted to RSS 2024. Project page: https://github.com/OpenDriveLab/MPI

R2 v1 2026-06-28T16:49:35.787Z