Related papers: Learning to Ground Visual Objects for Visual Dialo…

Towards Understanding Visual Grounding in Visual Language Models

Visual grounding refers to the ability of a model to identify a region within some visual input that matches a textual description. Consequently, a model equipped with visual grounding capabilities can target a wide range of applications in…

Computer Vision and Pattern Recognition · Computer Science 2025-09-16 Georgios Pantazopoulos , Eda B. Özyiğit

Towards Visual Grounding: A Survey

Visual Grounding, also known as Referring Expression Comprehension and Phrase Grounding, aims to ground the specific region(s) within the image(s) based on the given expression text. This task simulates the common referential relationships…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Linhui Xiao , Xiaoshan Yang , Xiangyuan Lan , Yaowei Wang , Changsheng Xu

Visual Relation Grounding in Videos

In this paper, we explore a novel task named visual Relation Grounding in Videos (vRGV). The task aims at spatio-temporally localizing the given relations in the form of subject-predicate-object in the videos, so as to provide supportive…

Computer Vision and Pattern Recognition · Computer Science 2020-07-22 Junbin Xiao , Xindi Shang , Xun Yang , Sheng Tang , Tat-Seng Chua

Context Disentangling and Prototype Inheriting for Robust Visual Grounding

Visual grounding (VG) aims to locate a specific target in an image based on a given language query. The discriminative information from context is important for distinguishing the target from other objects, particularly for the targets that…

Computer Vision and Pattern Recognition · Computer Science 2023-12-20 Wei Tang , Liang Li , Xuejing Liu , Lu Jin , Jinhui Tang , Zechao Li

Visual Grounding of Learned Physical Models

Humans intuitively recognize objects' physical properties and predict their motion, even when the objects are engaged in complicated interactions. The abilities to perform physical reasoning and to adapt to new environments, while intrinsic…

Machine Learning · Computer Science 2020-06-30 Yunzhu Li , Toru Lin , Kexin Yi , Daniel M. Bear , Daniel L. K. Yamins , Jiajun Wu , Joshua B. Tenenbaum , Antonio Torralba

ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding

Visual grounding aims to localize the object referred to in an image based on a natural language query. Although progress has been made recently, accurately localizing target objects within multiple-instance distractions (multiple objects…

Computer Vision and Pattern Recognition · Computer Science 2024-08-30 Minghang Zheng , Jiahua Zhang , Qingchao Chen , Yuxin Peng , Yang Liu

Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning

Visual grounding is a task to locate the target indicated by a natural language expression. Existing methods extend the generic object detection framework to this problem. They base the visual grounding on the features from pre-generated…

Computer Vision and Pattern Recognition · Computer Science 2022-06-09 Li Yang , Yan Xu , Chunfeng Yuan , Wei Liu , Bing Li , Weiming Hu

Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

Visual dialogue is a challenging task since it needs to answer a series of coherent questions on the basis of understanding the visual environment. Previous studies focus on the implicit exploration of multimodal co-reference by implicitly…

Computation and Language · Computer Science 2021-09-20 Feilong Chen , Fandong Meng , Xiuyi Chen , Peng Li , Jie Zhou

Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction

We study weakly-supervised video object grounding: given a video segment and a corresponding descriptive sentence, the goal is to localize objects that are mentioned from the sentence in the video. During training, no object bounding boxes…

Computer Vision and Pattern Recognition · Computer Science 2018-07-23 Luowei Zhou , Nathan Louis , Jason J. Corso

Modeling Coreference Relations in Visual Dialog

Visual dialog is a vision-language task where an agent needs to answer a series of questions grounded in an image based on the understanding of the dialog history and the image. The occurrences of coreference relations in the dialog makes…

Computer Vision and Pattern Recognition · Computer Science 2022-03-08 Mingxiao Li , Marie-Francine Moens

Object-Centric Diagnosis of Visual Reasoning

When answering questions about an image, it not only needs knowing what -- understanding the fine-grained contents (e.g., objects, relationships) in the image, but also telling why -- reasoning over grounding visual cues to derive the…

Computer Vision and Pattern Recognition · Computer Science 2020-12-22 Jianwei Yang , Jiayuan Mao , Jiajun Wu , Devi Parikh , David D. Cox , Joshua B. Tenenbaum , Chuang Gan

Language with Vision: a Study on Grounded Word and Sentence Embeddings

Grounding language in vision is an active field of research seeking to construct cognitively plausible word and sentence representations by incorporating perceptual knowledge from vision into text-based representations. Despite many…

Computation and Language · Computer Science 2023-11-01 Hassan Shahmohammadi , Maria Heitmeier , Elnaz Shafaei-Bajestan , Hendrik P. A. Lensch , Harald Baayen

A Better Loss for Visual-Textual Grounding

Given a textual phrase and an image, the visual grounding problem is the task of locating the content of the image referenced by the sentence. It is a challenging task that has several real-world applications in human-computer interaction,…

Computer Vision and Pattern Recognition · Computer Science 2022-02-03 Davide Rigoni , Luciano Serafini , Alessandro Sperduti

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Large vision-and-language models (VLMs) trained to match images with text on large-scale datasets of image-text pairs have shown impressive generalization ability on several vision and language tasks. Several recent works, however, showed…

Computer Vision and Pattern Recognition · Computer Science 2024-03-07 Navid Rajabi , Jana Kosecka

Incorporating Visual Semantics into Sentence Representations within a Grounded Space

Language grounding is an active field aiming at enriching textual representations with visual information. Generally, textual and visual elements are embedded in the same representation space, which implicitly assumes a one-to-one…

Computation and Language · Computer Science 2020-02-10 Patrick Bordes , Eloi Zablocki , Laure Soulier , Benjamin Piwowarski , Patrick Gallinari

Joint Visual Grounding with Language Scene Graphs

Visual grounding is a task to ground referring expressions in images, e.g., localize "the white truck in front of the yellow one". To resolve this task fundamentally, the model should first find out the contextual objects (e.g., the…

Computer Vision and Pattern Recognition · Computer Science 2020-04-13 Daqing Liu , Hanwang Zhang , Zheng-Jun Zha , Meng Wang , Qianru Sun

Relation-aware Instance Refinement for Weakly Supervised Visual Grounding

Visual grounding, which aims to build a correspondence between visual objects and their language entities, plays a key role in cross-modal scene understanding. One promising and scalable strategy for learning visual grounding is to utilize…

Computer Vision and Pattern Recognition · Computer Science 2021-03-25 Yongfei Liu , Bo Wan , Lin Ma , Xuming He

In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation

We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding, for open-vocabulary semantic segmentation. Plenty of the previous art casts this task as pixel-to-text classification…

Computer Vision and Pattern Recognition · Computer Science 2024-08-12 Dahyun Kang , Minsu Cho

Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Ideally, it can be a testbed for vision-and-language models to evaluate their understanding of the images and texts and their reasoning abilities…

Computer Vision and Pattern Recognition · Computer Science 2023-07-24 Zhihong Chen , Ruifei Zhang , Yibing Song , Xiang Wan , Guanbin Li

Grounded Semantic Composition for Visual Scenes

We present a visually-grounded language understanding model based on a study of how people verbally describe objects in scenes. The emphasis of the model is on the combination of individual word meanings to produce meanings for complex…

Artificial Intelligence · Computer Science 2011-07-04 P. Gorniak , D. Roy