Related papers: Language Grounding with 3D Objects
The human language is one of the most natural interfaces for humans to interact with robots. This paper presents a robot system that retrieves everyday objects with unconstrained natural language descriptions. A core issue for the system is…
For robots to understand human instructions and perform meaningful tasks in the near future, it is important to develop learned models that comprehend referential language to identify common objects in real-world 3D scenes. In this paper,…
In this work we explore how fine-grained differences between the shapes of common objects are expressed in language, grounded on images and 3D models of the objects. We first build a large scale, carefully controlled dataset of human…
Grounded understanding of natural language in physical scenes can greatly benefit robots that follow human instructions. In object manipulation scenarios, existing end-to-end models are proficient at understanding semantic concepts, but…
Localizing 3D objects using natural language is essential for robotic scene understanding. The descriptions often involve multiple spatial relationships to distinguish similar objects, making 3D-language alignment difficult. Current methods…
Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos. It essentially requires semantic comprehension and fine-grained video understanding. Existing methods introduce language…
We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natural language descriptions. Existing 3D visual grounding tasks focus on localizing a unique object given a text description. However, such a…
We develop a system to disambiguate object instances within the same class based on simple physical descriptions. The system takes as input a natural language phrase and a depth image containing a segmented object and predicts how similar…
Natural language object retrieval is a highly useful yet challenging task for robots in human-centric environments. Previous work has primarily focused on commands specifying the desired object's type such as "scissors" and/or visual…
Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations. In particular, it is often crucial to distinguish similar objects referred by the text, such as "the left most chair"…
When connecting objects and their language referents in an embodied 3D environment, it is important to note that: (1) an object can be better characterized by leveraging comparative information between itself and other objects, and (2) an…
As robots begin to cohabit with humans in semi-structured environments, the need arises to understand instructions involving rich variability---for instance, learning to ground symbols in the physical world. Realistically, this task must…
Robots are finding wider adoption in human environments, increasing the need for natural human-robot interaction. However, understanding a natural language command requires the robot to infer the intended task and how to decompose it into…
Semantic maps allow a robot to reason about its surroundings to fulfill tasks such as navigating known environments, finding specific objects, and exploring unmapped areas. Traditional mapping approaches provide accurate geometric…
The contribution of this paper is to provide a semantic model (using soft constraints) of the words used by web-users to describe objects in a language game; a game in which one user describes a selected object of those composing the scene,…
One of the long-term challenges of robotics is to enable robots to interact with humans in the visual world via natural language, as humans are visual animals that communicate through language. Overcoming this challenge requires the ability…
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. With a wide range of applications ranging from autonomous indoor robotics to AR/VR, the task has recently risen…
As robots become more ubiquitous and capable, it becomes ever more important to enable untrained users to easily interact with them. Recently, this has led to study of the language grounding problem, where the goal is to extract…
The integration of language and 3D perception is critical for embodied AI and robotic systems to perceive, understand, and interact with the physical world. Spatial reasoning, a key capability for understanding spatial relationships between…
3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query. Although many approaches have been proposed and achieved impressive performance, they all require dense object-sentence pair…