Related papers: Capturing Visual Environment Structure Correlates …

Manipulate by Seeing: Creating Manipulation Controllers from Pre-Trained Representations

The field of visual representation learning has seen explosive growth in the past years, but its benefits in robotics have been surprisingly limited so far. Prior work uses generic visual representations as a basis to learn (task-specific)…

Robotics · Computer Science 2023-08-16 Jianren Wang , Sudeep Dasari , Mohan Kumar Srirama , Shubham Tulsiani , Abhinav Gupta

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Visual representations play a crucial role in developing generalist robotic policies. Previous vision encoders, typically pre-trained with single-image reconstruction or two-image contrastive learning, tend to capture static information,…

Computer Vision and Pattern Recognition · Computer Science 2025-05-06 Yucheng Hu , Yanjiang Guo , Pengchao Wang , Xiaoyu Chen , Yen-Jen Wang , Jianke Zhang , Koushil Sreenath , Chaochao Lu , Jianyu Chen

Bridging the Sim2Real Gap: Vision Encoder Pre-Training for Visuomotor Policy Transfer

Simulation offers a scalable and efficient alternative to real-world data collection for learning visuomotor robotic policies. However, the simulation-to-reality, or Sim2Real distribution shift -- introduced by employing simulation-trained…

Robotics · Computer Science 2025-09-09 Yash Yardi , Samuel Biruduganti , Lars Ankile

Efficient Latent Representations using Multiple Tasks for Autonomous Driving

Driving in the dynamic, multi-agent, and complex urban environment is a difficult task requiring a complex decision policy. The learning of such a policy requires a state representation that can encode the entire environment. Mid-level…

Robotics · Computer Science 2020-03-03 Eshagh Kargar , Ville Kyrki

Object-Centric Representations Improve Policy Generalization in Robot Manipulation

Visual representations are central to the learning and generalization capabilities of robotic manipulation policies. While existing methods rely on global or dense features, such representations often entangle task-relevant and irrelevant…

Robotics · Computer Science 2025-05-20 Alexandre Chapin , Bruno Machado , Emmanuel Dellandrea , Liming Chen

What Makes Pre-Trained Visual Representations Successful for Robust Manipulation?

Inspired by the success of transfer learning in computer vision, roboticists have investigated visual pre-training as a means to improve the learning efficiency and generalization ability of policies learned from pixels. To that end, past…

Computer Vision and Pattern Recognition · Computer Science 2023-12-21 Kaylee Burns , Zach Witzel , Jubayer Ibn Hamid , Tianhe Yu , Chelsea Finn , Karol Hausman

Environment Predictive Coding for Embodied Agents

We introduce environment predictive coding, a self-supervised approach to learn environment-level representations for embodied agents. In contrast to prior work on self-supervised learning for images, we aim to jointly encode a series of…

Computer Vision and Pattern Recognition · Computer Science 2021-02-05 Santhosh K. Ramakrishnan , Tushar Nagarajan , Ziad Al-Halah , Kristen Grauman

Visual Representations for Semantic Target Driven Navigation

What is a good visual representation for autonomous agents? We address this question in the context of semantic visual navigation, which is the problem of a robot finding its way through a complex environment to a target object, e.g. go to…

Computer Vision and Pattern Recognition · Computer Science 2019-07-04 Arsalan Mousavian , Alexander Toshev , Marek Fiser , Jana Kosecka , Ayzaan Wahid , James Davidson

The Unsurprising Effectiveness of Pre-Trained Vision Models for Control

Recent years have seen the emergence of pre-trained representations as a powerful abstraction for AI applications in computer vision, natural language, and speech. However, policy learning for control is still dominated by a tabula-rasa…

Computer Vision and Pattern Recognition · Computer Science 2022-08-10 Simone Parisi , Aravind Rajeswaran , Senthil Purushwalkam , Abhinav Gupta

The Surprising Effectiveness of Representation Learning for Visual Imitation

While visual imitation learning offers one of the most effective ways of learning from visual demonstrations, generalizing from them requires either hundreds of diverse demonstrations, task specific priors, or large, hard-to-train…

Robotics · Computer Science 2021-12-07 Jyothish Pari , Nur Muhammad Shafiullah , Sridhar Pandian Arunachalam , Lerrel Pinto

Decoupling feature extraction from policy learning: assessing benefits of state representation learning in goal based robotics

Scaling end-to-end reinforcement learning to control real robots from vision presents a series of challenges, in particular in terms of sample efficiency. Against end-to-end learning, state representation learning can help learn a compact,…

Machine Learning · Computer Science 2019-06-25 Antonin Raffin , Ashley Hill , René Traoré , Timothée Lesort , Natalia Díaz-Rodríguez , David Filliat

Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Visuomotor Policies

How much does having visual priors about the world (e.g. the fact that the world is 3D) assist in learning to perform downstream motor tasks (e.g. delivering a package)? We study this question by integrating a generic perceptual skill set…

Computer Vision and Pattern Recognition · Computer Science 2019-04-23 Alexander Sax , Bradley Emi , Amir R. Zamir , Leonidas Guibas , Silvio Savarese , Jitendra Malik

Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation

Vision-based robotics often separates the control loop into one module for perception and a separate module for control. It is possible to train the whole system end-to-end (e.g. with deep RL), but doing it "from scratch" comes with a high…

Robotics · Computer Science 2020-11-16 Bryan Chen , Alexander Sax , Gene Lewis , Iro Armeni , Silvio Savarese , Amir Zamir , Jitendra Malik , Lerrel Pinto

Invariance is Key to Generalization: Examining the Role of Representation in Sim-to-Real Transfer for Visual Navigation

The data-driven approach to robot control has been gathering pace rapidly, yet generalization to unseen task domains remains a critical challenge. We argue that the key to generalization is representations that are (i) rich enough to…

Robotics · Computer Science 2023-12-05 Bo Ai , Zhanxin Wu , David Hsu

VGGT-DP: Generalizable Robot Control via Vision Foundation Models

Visual imitation learning frameworks allow robots to learn manipulation skills from expert demonstrations. While existing approaches mainly focus on policy design, they often neglect the structure and capacity of visual encoders, limiting…

Robotics · Computer Science 2025-09-24 Shijia Ge , Yinxin Zhang , Shuzhao Xie , Weixiang Zhang , Mingcai Zhou , Zhi Wang

Real-World Robot Learning with Masked Visual Pre-training

In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and…

Robotics · Computer Science 2022-10-07 Ilija Radosavovic , Tete Xiao , Stephen James , Pieter Abbeel , Jitendra Malik , Trevor Darrell

Object-Centric Action-Enhanced Representations for Robot Visuo-Motor Policy Learning

Learning visual representations from observing actions to benefit robot visuo-motor policy generation is a promising direction that closely resembles human cognitive function and perception. Motivated by this, and further inspired by…

Robotics · Computer Science 2025-05-28 Nikos Giannakakis , Argyris Manetas , Panagiotis P. Filntisis , Petros Maragos , George Retsinas

Natural Language Can Help Bridge the Sim2Real Gap

The main challenge in learning image-conditioned robotic policies is acquiring a visual representation conducive to low-level control. Due to the high dimensionality of the image space, learning a good visual representation requires a…

Robotics · Computer Science 2024-07-03 Albert Yu , Adeline Foote , Raymond Mooney , Roberto Martín-Martín

An Exploration of Embodied Visual Exploration

Embodied computer vision considers perception for robots in novel, unstructured environments. Of particular importance is the embodied visual exploration problem: how might a robot equipped with a camera scope out a new environment? Despite…

Computer Vision and Pattern Recognition · Computer Science 2020-08-24 Santhosh K. Ramakrishnan , Dinesh Jayaraman , Kristen Grauman

Interpreting the structure of multi-object representations in vision encoders

In this work, we interpret the representations of multi-object scenes in vision encoders through the lens of structured representations. Structured representations allow modeling of individual objects distinctly and their flexible use based…

Computer Vision and Pattern Recognition · Computer Science 2025-04-08 Tarun Khajuria , Braian Olmiro Dias , Marharyta Domnich , Jaan Aru