Related papers: Self-Consistent Model-based Adaptation for Visual …
Generalizing policies to unseen scenarios remains a critical challenge in visual reinforcement learning, where agents often overfit to the specific visual observations of the training environment. In unseen environments, distracting pixels…
Standard vision-language-action (VLA) models rely on fitting statistical data priors, limiting their robust understanding of underlying physical dynamics. Reinforcement learning enhances physical grounding through exploration yet typically…
In this paper we explore using self-supervised correspondence for improving the generalization performance and sample efficiency of visuomotor policy learning. Prior work has primarily used approaches such as autoencoding, pose-based…
Recently, Transformer-based architecture has been introduced into single image deraining task due to its advantage in modeling non-local information. However, existing approaches tend to integrate global features based on a dense…
Training reinforcement learning (RL) policies for legged locomotion often requires extensive environment interactions, which are costly and time-consuming. We propose Symmetry-Guided Memory Augmentation (SGMA), a framework that improves…
We use model-free reinforcement learning, extensive simulation, and transfer learning to develop a continuous control algorithm that has good zero-shot performance in a real physical environment. We train a simulated agent to act optimally…
Vision-language models (VLMs) struggle in open-world applications, where out-of-distribution (OOD) concepts often trigger cross-modal alignment collapse and severely degrade zero-shot performance. We identify the root cause as modal…
Learning policies which are robust to changes in the environment are critical for real world deployment of Reinforcement Learning agents. They are also necessary for achieving good generalization across environment shifts. We focus on…
Visual Question and Answering (VQA) problems are attracting increasing interest from multiple research disciplines. Solving VQA problems requires techniques from both computer vision for understanding the visual contents of a presented…
Simulation is a useful tool in situations where training data for machine learning models is costly to annotate or even hard to acquire. In this work, we propose a reinforcement learning-based method for automatically adjusting the…
We propose a domain adaptation method, MoDA, which adapts a pretrained embodied agent to a new, noisy environment without ground-truth supervision. Map-based memory provides important contextual information for visual navigation, and…
How to accurately learn task-relevant state representations from high-dimensional observations with visual distractions is a realistic and challenging problem in visual reinforcement learning. Recently, unsupervised representation learning…
Reinforcement learning often requires extensive training data. Simulation-to-real transfer offers a promising approach to address this challenge in robotics. While differentiable simulators offer improved sample efficiency through exact…
Visual question answering (VQA) is a critical multimodal task in which an agent must answer questions according to the visual cue. Unfortunately, language bias is a common problem in VQA, which refers to the model generating answers only by…
Autonomous agents need large repertoires of skills to act reasonably on new tasks that they have not seen before. However, acquiring these skills using only a stream of high-dimensional, unstructured, and unlabeled observations is a tricky…
Learning from previously collected data via behavioral cloning or offline reinforcement learning (RL) is a powerful recipe for scaling generalist agents by avoiding the need for expensive online learning. Despite strong generalization in…
In this paper, we introduce a realistic and challenging domain adaptation problem called Universal Semi-supervised Model Adaptation (USMA), which i) requires only a pre-trained source model, ii) allows the source and target domain to have…
Recently, state space models (SSM), particularly Mamba, have attracted significant attention from scholars due to their ability to effectively balance computational efficiency and performance. However, most existing visual Mamba methods…
World models have been developed to support sample-efficient deep reinforcement learning agents. However, it remains challenging for world models to accurately replicate environments that are high-dimensional, non-stationary, and composed…
Vision Transformer (ViT) has recently gained significant attention in solving computer vision (CV) problems due to its capability of extracting informative features and modeling long-range dependencies through the attention mechanism.…