Related papers: Language Grounding with 3D Objects

Grounding Spatio-Semantic Referring Expressions for Human-Robot Interaction

The human language is one of the most natural interfaces for humans to interact with robots. This paper presents a robot system that retrieves everyday objects with unconstrained natural language descriptions. A core issue for the system is…

Robotics · Computer Science 2017-07-19 Mohit Shridhar , David Hsu

LanguageRefer: Spatial-Language Model for 3D Visual Grounding

For robots to understand human instructions and perform meaningful tasks in the near future, it is important to develop learned models that comprehend referential language to identify common objects in real-world 3D scenes. In this paper,…

Robotics · Computer Science 2021-11-08 Junha Roh , Karthik Desingh , Ali Farhadi , Dieter Fox

ShapeGlot: Learning Language for Shape Differentiation

In this work we explore how fine-grained differences between the shapes of common objects are expressed in language, grounded on images and 3D models of the objects. We first build a large scale, carefully controlled dataset of human…

Computation and Language · Computer Science 2019-05-09 Panos Achlioptas , Judy Fan , Robert X. D. Hawkins , Noah D. Goodman , Leonidas J. Guibas

Grounding Object Relations in Language-Conditioned Robotic Manipulation with Semantic-Spatial Reasoning

Grounded understanding of natural language in physical scenes can greatly benefit robots that follow human instructions. In object manipulation scenarios, existing end-to-end models are proficient at understanding semantic concepts, but…

Robotics · Computer Science 2023-04-03 Qian Luo , Yunfei Li , Yi Wu

B2N3D: Progressive Learning from Binary to N-ary Relationships for 3D Object Grounding

Localizing 3D objects using natural language is essential for robotic scene understanding. The descriptions often involve multiple spatial relationships to distinguish similar objects, making 3D-language alignment difficult. Current methods…

Computer Vision and Pattern Recognition · Computer Science 2025-12-02 Feng Xiao , Hongbin Xu , Hai Ci , Wenxiong Kang

ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation

Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos. It essentially requires semantic comprehension and fine-grained video understanding. Existing methods introduce language…

Computer Vision and Pattern Recognition · Computer Science 2024-01-22 Chen Liang , Yu Wu , Yawei Luo , Yi Yang

Multi3DRefer: Grounding Text Description to Multiple 3D Objects

We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natural language descriptions. Existing 3D visual grounding tasks focus on localizing a unique object given a text description. However, such a…

Computer Vision and Pattern Recognition · Computer Science 2023-09-12 Yiming Zhang , ZeMing Gong , Angel X. Chang

Grounding Language Attributes to Objects using Bayesian Eigenobjects

We develop a system to disambiguate object instances within the same class based on simple physical descriptions. The system takes as input a natural language phrase and a depth image containing a segmented object and predicts how similar…

Robotics · Computer Science 2019-08-05 Vanya Cohen , Benjamin Burchfiel , Thao Nguyen , Nakul Gopalan , Stefanie Tellex , George Konidaris

Robot Object Retrieval with Contextual Natural Language Queries

Natural language object retrieval is a highly useful yet challenging task for robots in human-centric environments. Previous work has primarily focused on commands specifying the desired object's type such as "scissors" and/or visual…

Robotics · Computer Science 2020-06-25 Thao Nguyen , Nakul Gopalan , Roma Patel , Matt Corsaro , Ellie Pavlick , Stefanie Tellex

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations. In particular, it is often crucial to distinguish similar objects referred by the text, such as "the left most chair"…

Computer Vision and Pattern Recognition · Computer Science 2022-11-18 Shizhe Chen , Pierre-Louis Guhur , Makarand Tapaswi , Cordelia Schmid , Ivan Laptev

Which One? Leveraging Context Between Objects and Multiple Views for Language Grounding

When connecting objects and their language referents in an embodied 3D environment, it is important to note that: (1) an object can be better characterized by leveraging comparative information between itself and other objects, and (2) an…

Computation and Language · Computer Science 2024-04-11 Chancharik Mitra , Abrar Anwar , Rodolfo Corona , Dan Klein , Trevor Darrell , Jesse Thomason

Grounding Symbols in Multi-Modal Instructions

As robots begin to cohabit with humans in semi-structured environments, the need arises to understand instructions involving rich variability---for instance, learning to ground symbols in the physical world. Realistically, this task must…

Artificial Intelligence · Computer Science 2017-06-02 Yordan Hristov , Svetlin Penkov , Alex Lascarides , Subramanian Ramamoorthy

Relational Scene Graphs for Object Grounding of Natural Language Commands

Robots are finding wider adoption in human environments, increasing the need for natural human-robot interaction. However, understanding a natural language command requires the robot to infer the intended task and how to decompose it into…

Robotics · Computer Science 2026-02-05 Julia Kuhn , Francesco Verdoja , Tsvetomila Mihaylova , Ville Kyrki

LIEREx: Language-Image Embeddings for Robotic Exploration

Semantic maps allow a robot to reason about its surroundings to fulfill tasks such as navigating known environments, finding specific objects, and exploring unmapped areas. Traditional mapping approaches provide accurate geometric…

Robotics · Computer Science 2026-02-03 Felix Igelbrink , Lennart Niecksch , Marian Renz , Martin Günther , Martin Atzmueller

Using Soft Constraints To Learn Semantic Models Of Descriptions Of Shapes

The contribution of this paper is to provide a semantic model (using soft constraints) of the words used by web-users to describe objects in a language game; a game in which one user describes a selected object of those composing the scene,…

Computation and Language · Computer Science 2010-05-31 Sergio Guadarrama , David P. Pancho

REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments

One of the long-term challenges of robotics is to enable robots to interact with humans in the visual world via natural language, as humans are visual animals that communicate through language. Overcoming this challenge requires the ability…

Computer Vision and Pattern Recognition · Computer Science 2020-01-07 Yuankai Qi , Qi Wu , Peter Anderson , Xin Wang , William Yang Wang , Chunhua Shen , Anton van den Hengel

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. With a wide range of applications ranging from autonomous indoor robotics to AR/VR, the task has recently risen…

Computer Vision and Pattern Recognition · Computer Science 2024-07-17 Ozan Unal , Christos Sakaridis , Suman Saha , Luc Van Gool

A Joint Model of Language and Perception for Grounded Attribute Learning

As robots become more ubiquitous and capable, it becomes ever more important to enable untrained users to easily interact with them. Recently, this has led to study of the language grounding problem, where the goal is to extract…

Computation and Language · Computer Science 2012-07-03 Cynthia Matuszek , Nicholas FitzGerald , Luke Zettlemoyer , Liefeng Bo , Dieter Fox

SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes

The integration of language and 3D perception is critical for embodied AI and robotic systems to perceive, understand, and interact with the physical world. Spatial reasoning, a key capability for understanding spatial relationships between…

Computer Vision and Pattern Recognition · Computer Science 2025-07-11 Jiaxin Huang , Ziwen Li , Hanlve Zhang , Runnan Chen , Xiao He , Yandong Guo , Wenping Wang , Tongliang Liu , Mingming Gong

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding

3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query. Although many approaches have been proposed and achieved impressive performance, they all require dense object-sentence pair…

Computer Vision and Pattern Recognition · Computer Science 2023-07-19 Zehan Wang , Haifeng Huang , Yang Zhao , Linjun Li , Xize Cheng , Yichen Zhu , Aoxiong Yin , Zhou Zhao