English
Related papers

Related papers: Incremental Object Grounding Using Scene Graphs

200 papers

This paper presents a framework for jointly grounding objects that follow certain semantic relationship constraints given in a scene graph. A typical natural scene contains several objects, often exhibiting visual relationships of varied…

Computer Vision and Pattern Recognition · Computer Science 2022-11-04 Aditay Tripathi , Anand Mishra , Anirban Chakraborty

Scene graphs provide structured semantic understanding beyond images. For downstream tasks, such as image retrieval, visual question answering, visual relationship detection, and even autonomous vehicle technology, scene graphs can not only…

Computer Vision and Pattern Recognition · Computer Science 2022-10-21 Mingzhe Du

Robots are finding wider adoption in human environments, increasing the need for natural human-robot interaction. However, understanding a natural language command requires the robot to infer the intended task and how to decompose it into…

Robotics · Computer Science 2026-02-05 Julia Kuhn , Francesco Verdoja , Tsvetomila Mihaylova , Ville Kyrki

Enabling robots to grasp objects specified through natural language is essential for effective human-robot interaction, yet it remains a significant challenge. Existing approaches often struggle with open-form language expressions and…

Robotics · Computer Science 2025-09-11 Houjian Yu , Zheming Zhou , Min Sun , Omid Ghasemalizadeh , Yuyin Sun , Cheng-Hao Kuo , Arnie Sen , Changhyun Choi

Visual question answering is concerned with answering free-form questions about an image. Since it requires a deep linguistic understanding of the question and the ability to associate it with various objects that are present in the image,…

Machine Learning · Computer Science 2020-07-03 Marcel Hildebrandt , Hang Li , Rajat Koner , Volker Tresp , Stephan Günnemann

Traditional scene graphs primarily focus on spatial relationships, limiting vision-language models' (VLMs) ability to reason about complex interactions in visual scenes. This paper addresses two key challenges: (1) conventional…

Computer Vision and Pattern Recognition · Computer Science 2025-05-15 Dayong Liang , Changmeng Zheng , Zhiyuan Wen , Yi Cai , Xiao-Yong Wei , Qing Li

Grounding referring expressions aims to locate in an image an object referred to by a natural language expression. The linguistic structure of a referring expression provides a layout of reasoning over the visual contents, and it is often…

Computer Vision and Pattern Recognition · Computer Science 2020-04-21 Sibei Yang , Guanbin Li , Yizhou Yu

3D scene graphs have empowered robots with semantic understanding for navigation and planning. However, current functional scene graphs primarily focus on static element detection, lacking the actionable kinematic information required for…

A scene graph is a semantic representation that expresses the objects, attributes, and relationships between objects in a scene. Scene graphs play an important role in many cross modality tasks, as they are able to capture the interactions…

Computer Vision and Pattern Recognition · Computer Science 2022-09-20 Xuming Hu , Zhijiang Guo , Yu Fu , Lijie Wen , Philip S. Yu

Data augmentation is an essential technique in improving the generalization of deep neural networks. The majority of existing image-domain augmentations either rely on geometric and structural transformations, or apply different kinds of…

Computer Vision and Pattern Recognition · Computer Science 2025-05-21 Morgan Heisler , Amin Banitalebi-Dehkordi , Yong Zhang

We present an Open-Vocabulary 3D Scene Graph (OVSG), a formal framework for grounding a variety of entities, such as object instances, agents, and regions, with free-form text-based queries. Unlike conventional semantic-based object…

The human language is one of the most natural interfaces for humans to interact with robots. This paper presents a robot system that retrieves everyday objects with unconstrained natural language descriptions. A core issue for the system is…

Robotics · Computer Science 2017-07-19 Mohit Shridhar , David Hsu

3D visual grounding aims to localize the unique target described by natural languages in 3D scenes. The significant gap between 3D and language modalities makes it a notable challenge to distinguish multiple similar objects through the…

Computer Vision and Pattern Recognition · Computer Science 2025-08-18 Feng Xiao , Hongbin Xu , Guocan Zhao , Wenxiong Kang

Graph based representation has been widely used in modelling spatio-temporal relationships in video understanding. Although effective, existing graph-based approaches focus on capturing the human-object relationships while ignoring…

Computer Vision and Pattern Recognition · Computer Science 2025-01-14 Chinthani Sugandhika , Chen Li , Deepu Rajan , Basura Fernando

This paper presents INGRESS, a robot system that follows human natural language instructions to pick and place everyday objects. The core issue here is the grounding of referring expressions: infer objects and their relationships from input…

Robotics · Computer Science 2018-06-12 Mohit Shridhar , David Hsu

In recent years, developing AI for robotics has raised much attention. The interaction of vision and language of robots is particularly difficult. We consider that giving robots an understanding of visual semantics and language semantics…

Robotics · Computer Science 2021-05-26 Cheng Yu Tsai , Mu-Chun Su

How can we build robots for open-world semantic navigation tasks, like searching for target objects in novel scenes? While foundation models have the rich knowledge and generalisation needed for these tasks, a suitable scene representation…

Robotics · Computer Science 2024-07-03 Joel Loo , Zhanxin Wu , David Hsu

As robots begin to cohabit with humans in semi-structured environments, the need arises to understand instructions involving rich variability---for instance, learning to ground symbols in the physical world. Realistically, this task must…

Artificial Intelligence · Computer Science 2017-06-02 Yordan Hristov , Svetlin Penkov , Alex Lascarides , Subramanian Ramamoorthy

Visual grounding aims to localize the object referred to in an image based on a natural language query. Although progress has been made recently, accurately localizing target objects within multiple-instance distractions (multiple objects…

Computer Vision and Pattern Recognition · Computer Science 2024-08-30 Minghang Zheng , Jiahua Zhang , Qingchao Chen , Yuxin Peng , Yang Liu

Graph-based representations such as Scene Graphs enable localization in structured indoor environments by matching a locally observed graph, constructed from sensor data, to a prior map. This process is particularly challenging in…

‹ Prev 1 2 3 10 Next ›