Related papers: Language-Driven Representation Learning for Roboti…

Manipulate by Seeing: Creating Manipulation Controllers from Pre-Trained Representations

The field of visual representation learning has seen explosive growth in the past years, but its benefits in robotics have been surprisingly limited so far. Prior work uses generic visual representations as a basis to learn (task-specific)…

Robotics · Computer Science 2023-08-16 Jianren Wang , Sudeep Dasari , Mohan Kumar Srirama , Shubham Tulsiani , Abhinav Gupta

Masked World Models for Visual Control

Visual model-based reinforcement learning (RL) has the potential to enable sample-efficient robot learning from visual observations. Yet the current approaches typically train a single model end-to-end for learning both visual…

Robotics · Computer Science 2023-05-30 Younggyo Seo , Danijar Hafner , Hao Liu , Fangchen Liu , Stephen James , Kimin Lee , Pieter Abbeel

The Surprising Effectiveness of Representation Learning for Visual Imitation

While visual imitation learning offers one of the most effective ways of learning from visual demonstrations, generalizing from them requires either hundreds of diverse demonstrations, task specific priors, or large, hard-to-train…

Robotics · Computer Science 2021-12-07 Jyothish Pari , Nur Muhammad Shafiullah , Sridhar Pandian Arunachalam , Lerrel Pinto

Multi-View Masked World Models for Visual Robotic Manipulation

Visual robotic manipulation research and applications often use multiple cameras, or views, to better perceive the world. How else can we utilize the richness of multi-view data? In this paper, we investigate how to learn good…

Robotics · Computer Science 2023-06-01 Younggyo Seo , Junsu Kim , Stephen James , Kimin Lee , Jinwoo Shin , Pieter Abbeel

What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data

A long-standing goal in robotics is to build robots that can perform a wide range of daily tasks from perceptions obtained with their onboard sensors and specified only via natural language. While recently substantial advances have been…

Robotics · Computer Science 2022-08-31 Oier Mees , Lukas Hermann , Wolfram Burgard

Object-Centric Action-Enhanced Representations for Robot Visuo-Motor Policy Learning

Learning visual representations from observing actions to benefit robot visuo-motor policy generation is a promising direction that closely resembles human cognitive function and perception. Motivated by this, and further inspired by…

Robotics · Computer Science 2025-05-28 Nikos Giannakakis , Argyris Manetas , Panagiotis P. Filntisis , Petros Maragos , George Retsinas

Imitation Learning of Robot Policies by Combining Language, Vision and Demonstration

In this work we propose a novel end-to-end imitation learning approach which combines natural language, vision, and motion information to produce an abstract representation of a task, which in turn is used to synthesize specific motion…

Robotics · Computer Science 2019-11-27 Simon Stepputtis , Joseph Campbell , Mariano Phielipp , Chitta Baral , Heni Ben Amor

Human-oriented Representation Learning for Robotic Manipulation

Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks. We advocate that such a representation automatically arises from…

Robotics · Computer Science 2023-10-05 Mingxiao Huo , Mingyu Ding , Chenfeng Xu , Thomas Tian , Xinghao Zhu , Yao Mu , Lingfeng Sun , Masayoshi Tomizuka , Wei Zhan

End-to-End Multimodal Representation Learning for Video Dialog

Video-based dialog task is a challenging multimodal learning task that has received increasing attention over the past few years with state-of-the-art obtaining new performance records. This progress is largely powered by the adaptation of…

Computer Vision and Pattern Recognition · Computer Science 2022-10-27 Huda Alamri , Anthony Bilic , Michael Hu , Apoorva Beedu , Irfan Essa

Real-World Robot Learning with Masked Visual Pre-training

In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and…

Robotics · Computer Science 2022-10-07 Ilija Radosavovic , Tete Xiao , Stephen James , Pieter Abbeel , Jitendra Malik , Trevor Darrell

Multi-Modal Representation Learning with Text-Driven Soft Masks

We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text…

Computer Vision and Pattern Recognition · Computer Science 2023-04-04 Jaeyoo Park , Bohyung Han

LaVA-Man: Learning Visual Action Representations for Robot Manipulation

Visual-textual understanding is essential for language-guided robot manipulation. Recent works leverage pre-trained vision-language models to measure the similarity between encoded visual observations and textual instructions, and then…

Robotics · Computer Science 2025-09-30 Chaoran Zhu , Hengyi Wang , Yik Lung Pang , Changjae Oh

Robotic Applications of Pre-Trained Vision-Language Models to Various Recognition Behaviors

In recent years, a number of models that learn the relations between vision and language from large datasets have been released. These models perform a variety of tasks, such as answering questions about images, retrieving sentences that…

Robotics · Computer Science 2024-03-19 Kento Kawaharazuka , Yoshiki Obinata , Naoaki Kanazawa , Kei Okada , Masayuki Inaba

Self-supervised video pretraining yields robust and more human-aligned visual representations

Humans learn powerful representations of objects and scenes by observing how they evolve over time. Yet, outside of specific tasks that require explicit temporal understanding, static image pretraining remains the dominant paradigm for…

Computer Vision and Pattern Recognition · Computer Science 2025-01-13 Nikhil Parthasarathy , S. M. Ali Eslami , João Carreira , Olivier J. Hénaff

Learning Visually Guided Latent Actions for Assistive Teleoperation

It is challenging for humans -- particularly those living with physical disabilities -- to control high-dimensional, dexterous robots. Prior work explores learning embedding functions that map a human's low-dimensional inputs (e.g., via a…

Robotics · Computer Science 2021-05-04 Siddharth Karamcheti , Albert J. Zhai , Dylan P. Losey , Dorsa Sadigh

Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications

Multimodality Representation Learning, as a technique of learning to embed information from different modalities and their correlations, has achieved remarkable success on a variety of applications, such as Visual Question Answering (VQA),…

Artificial Intelligence · Computer Science 2024-03-04 Muhammad Arslan Manzoor , Sarah Albarri , Ziting Xian , Zaiqiao Meng , Preslav Nakov , Shangsong Liang

Learning Visual-Audio Representations for Voice-Controlled Robots

Based on the recent advancements in representation learning, we propose a novel pipeline for task-oriented voice-controlled robots with raw sensor inputs. Previous methods rely on a large number of labels and task-specific reward functions.…

Robotics · Computer Science 2023-03-07 Peixin Chang , Shuijing Liu , D. Livingston McPherson , Katherine Driggs-Campbell

Grounding Bodily Awareness in Visual Representations for Efficient Policy Learning

Learning effective visual representations for robotic manipulation remains a fundamental challenge due to the complex body dynamics involved in action execution. In this paper, we study how visual representations that carry body-relevant…

Robotics · Computer Science 2026-02-17 Junlin Wang , Zhiyun Lin

Grounding Language with Visual Affordances over Unstructured Data

Recent works have shown that Large Language Models (LLMs) can be applied to ground natural language to a wide variety of robot skills. However, in practice, learning multi-task, language-conditioned robotic skills typically requires…

Robotics · Computer Science 2023-03-09 Oier Mees , Jessica Borja-Diaz , Wolfram Burgard

Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets

The pre-training of visual representations has enhanced the efficiency of robot learning. Due to the lack of large-scale in-domain robotic datasets, prior works utilize in-the-wild human videos to pre-train robotic visual representation.…

Robotics · Computer Science 2024-10-31 Guangqi Jiang , Yifei Sun , Tao Huang , Huanyu Li , Yongyuan Liang , Huazhe Xu