Related papers: Improved Visual Grounding through Self-Consistent …

Grounding of Textual Phrases in Images by Reconstruction

Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Few datasets provide the ground truth…

Computer Vision and Pattern Recognition · Computer Science 2017-02-21 Anna Rohrbach , Marcus Rohrbach , Ronghang Hu , Trevor Darrell , Bernt Schiele

Relation-aware Instance Refinement for Weakly Supervised Visual Grounding

Visual grounding, which aims to build a correspondence between visual objects and their language entities, plays a key role in cross-modal scene understanding. One promising and scalable strategy for learning visual grounding is to utilize…

Computer Vision and Pattern Recognition · Computer Science 2021-03-25 Yongfei Liu , Bo Wan , Lin Ma , Xuming He

Learning Unsupervised Visual Grounding Through Semantic Self-Supervision

Localizing natural language phrases in images is a challenging problem that requires joint understanding of both the textual and visual modalities. In the unsupervised setting, lack of supervisory signals exacerbate this difficulty. In this…

Computer Vision and Pattern Recognition · Computer Science 2018-11-20 Syed Ashar Javed , Shreyas Saxena , Vineet Gandhi

Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment

We address the problem of grounding free-form textual phrases by using weak supervision from image-caption pairs. We propose a novel end-to-end model that uses caption-to-image retrieval as a `downstream' task to guide the process of phrase…

Computer Vision and Pattern Recognition · Computer Science 2019-10-16 Samyak Datta , Karan Sikka , Anirban Roy , Karuna Ahuja , Devi Parikh , Ajay Divakaran

Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations

We propose a margin-based loss for tuning joint vision-language models so that their gradient-based explanations are consistent with region-level annotations provided by humans for relatively smaller grounding datasets. We refer to this…

Computer Vision and Pattern Recognition · Computer Science 2024-01-09 Ziyan Yang , Kushal Kafle , Franck Dernoncourt , Vicente Ordonez

Contrastive Learning for Weakly Supervised Phrase Grounding

Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on…

Computer Vision and Pattern Recognition · Computer Science 2020-08-07 Tanmay Gupta , Arash Vahdat , Gal Chechik , Xiaodong Yang , Jan Kautz , Derek Hoiem

Grounding Visual Explanations (Extended Abstract)

Existing models which generate textual explanations enforce task relevance through a discriminative term loss function, but such mechanisms only weakly constrain mentioned object parts to actually be present in the image. In this paper, a…

Computer Vision and Pattern Recognition · Computer Science 2017-11-20 Lisa Anne Hendricks , Ronghang Hu , Trevor Darrell , Zeynep Akata

Towards Visual Grounding: A Survey

Visual Grounding, also known as Referring Expression Comprehension and Phrase Grounding, aims to ground the specific region(s) within the image(s) based on the given expression text. This task simulates the common referential relationships…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Linhui Xiao , Xiaoshan Yang , Xiangyuan Lan , Yaowei Wang , Changsheng Xu

Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models

Grounding-based vision and language models have been successfully applied to low-level vision tasks, aiming to precisely locate objects referred in captions. The effectiveness of grounding representation learning heavily relies on the scale…

Computer Vision and Pattern Recognition · Computer Science 2023-11-07 Jingru Yi , Burak Uzkent , Oana Ignat , Zili Li , Amanmeet Garg , Xiang Yu , Linda Liu

Improving Visually Grounded Sentence Representations with Self-Attention

Sentence representation models trained only on language could potentially suffer from the grounding problem. Recent work has shown promising results in improving the qualities of sentence representations by jointly training them with…

Computation and Language · Computer Science 2017-12-05 Kang Min Yoo , Youhyun Shin , Sang-goo Lee

Visual Semantic Reasoning for Image-Text Matching

Image-text matching has been a hot research topic bridging the vision and language areas. It remains challenging because the current representation of image usually lacks global semantic concepts as in its corresponding text caption. To…

Computer Vision and Pattern Recognition · Computer Science 2019-09-09 Kunpeng Li , Yulun Zhang , Kai Li , Yuanyuan Li , Yun Fu

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Large vision-and-language models (VLMs) trained to match images with text on large-scale datasets of image-text pairs have shown impressive generalization ability on several vision and language tasks. Several recent works, however, showed…

Computer Vision and Pattern Recognition · Computer Science 2024-03-07 Navid Rajabi , Jana Kosecka

Learning to Generate Grounded Visual Captions without Localization Supervision

When automatically generating a sentence description for an image or video, it often remains unclear how well the generated caption is grounded, that is whether the model uses the correct image regions to output particular words, or if the…

Computer Vision and Pattern Recognition · Computer Science 2020-07-21 Chih-Yao Ma , Yannis Kalantidis , Ghassan AlRegib , Peter Vajda , Marcus Rohrbach , Zsolt Kira

Learning Cross-modal Context Graph for Visual Grounding

Visual grounding is a ubiquitous building block in many vision-language tasks and yet remains challenging due to large variations in visual and linguistic features of grounding entities, strong context effect and the resulting semantic…

Computer Vision and Pattern Recognition · Computer Science 2019-11-26 Yongfei Liu , Bo Wan , Xiaodan Zhu , Xuming He

Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

Visual grounding, i.e., localizing objects in images according to natural language queries, is an important topic in visual language understanding. The most effective approaches for this task are based on deep learning, which generally…

Computer Vision and Pattern Recognition · Computer Science 2022-11-18 Haojun Jiang , Yuanze Lin , Dongchen Han , Shiji Song , Gao Huang

Attention-Based Keyword Localisation in Speech using Visual Grounding

Visually grounded speech models learn from images paired with spoken captions. By tagging images with soft text labels using a trained visual classifier with a fixed vocabulary, previous work has shown that it is possible to train a model…

Computation and Language · Computer Science 2021-06-24 Kayode Olaleye , Herman Kamper

Weakly-Supervised Visual-Textual Grounding with Semantic Prior Refinement

Using only image-sentence pairs, weakly-supervised visual-textual grounding aims to learn region-phrase correspondences of the respective entity mentions. Compared to the supervised approach, learning is more difficult since bounding boxes…

Computer Vision and Pattern Recognition · Computer Science 2023-09-27 Davide Rigoni , Luca Parolari , Luciano Serafini , Alessandro Sperduti , Lamberto Ballan

Learning Visual Grounding from Generative Vision and Language Model

Visual grounding tasks aim to localize image regions based on natural language references. In this work, we explore whether generative VLMs predominantly trained on image-text data could be leveraged to scale up the text annotation of…

Computer Vision and Pattern Recognition · Computer Science 2024-07-23 Shijie Wang , Dahun Kim , Ali Taalimi , Chen Sun , Weicheng Kuo

Fine-Grained Grounding for Multimodal Speech Recognition

Multimodal automatic speech recognition systems integrate information from images to improve speech recognition quality, by grounding the speech in the visual context. While visual signals have been shown to be useful for recovering…

Computation and Language · Computer Science 2020-10-07 Tejas Srinivasan , Ramon Sanabria , Florian Metze , Desmond Elliott

Grounding Visual Explanations

Existing visual explanation generating agents learn to fluently justify a class prediction. However, they may mention visual attributes which reflect a strong class prior, although the evidence may not actually be in the image. This is…

Computer Vision and Pattern Recognition · Computer Science 2018-08-03 Lisa Anne Hendricks , Ronghang Hu , Trevor Darrell , Zeynep Akata