English
Related papers

Related papers: Improved Visual Grounding through Self-Consistent …

200 papers

Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Few datasets provide the ground truth…

Computer Vision and Pattern Recognition · Computer Science 2017-02-21 Anna Rohrbach , Marcus Rohrbach , Ronghang Hu , Trevor Darrell , Bernt Schiele

Visual grounding, which aims to build a correspondence between visual objects and their language entities, plays a key role in cross-modal scene understanding. One promising and scalable strategy for learning visual grounding is to utilize…

Computer Vision and Pattern Recognition · Computer Science 2021-03-25 Yongfei Liu , Bo Wan , Lin Ma , Xuming He

Localizing natural language phrases in images is a challenging problem that requires joint understanding of both the textual and visual modalities. In the unsupervised setting, lack of supervisory signals exacerbate this difficulty. In this…

Computer Vision and Pattern Recognition · Computer Science 2018-11-20 Syed Ashar Javed , Shreyas Saxena , Vineet Gandhi

We address the problem of grounding free-form textual phrases by using weak supervision from image-caption pairs. We propose a novel end-to-end model that uses caption-to-image retrieval as a `downstream' task to guide the process of phrase…

Computer Vision and Pattern Recognition · Computer Science 2019-10-16 Samyak Datta , Karan Sikka , Anirban Roy , Karuna Ahuja , Devi Parikh , Ajay Divakaran

We propose a margin-based loss for tuning joint vision-language models so that their gradient-based explanations are consistent with region-level annotations provided by humans for relatively smaller grounding datasets. We refer to this…

Computer Vision and Pattern Recognition · Computer Science 2024-01-09 Ziyan Yang , Kushal Kafle , Franck Dernoncourt , Vicente Ordonez

Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on…

Computer Vision and Pattern Recognition · Computer Science 2020-08-07 Tanmay Gupta , Arash Vahdat , Gal Chechik , Xiaodong Yang , Jan Kautz , Derek Hoiem

Existing models which generate textual explanations enforce task relevance through a discriminative term loss function, but such mechanisms only weakly constrain mentioned object parts to actually be present in the image. In this paper, a…

Computer Vision and Pattern Recognition · Computer Science 2017-11-20 Lisa Anne Hendricks , Ronghang Hu , Trevor Darrell , Zeynep Akata

Visual Grounding, also known as Referring Expression Comprehension and Phrase Grounding, aims to ground the specific region(s) within the image(s) based on the given expression text. This task simulates the common referential relationships…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Linhui Xiao , Xiaoshan Yang , Xiangyuan Lan , Yaowei Wang , Changsheng Xu

Grounding-based vision and language models have been successfully applied to low-level vision tasks, aiming to precisely locate objects referred in captions. The effectiveness of grounding representation learning heavily relies on the scale…

Computer Vision and Pattern Recognition · Computer Science 2023-11-07 Jingru Yi , Burak Uzkent , Oana Ignat , Zili Li , Amanmeet Garg , Xiang Yu , Linda Liu

Sentence representation models trained only on language could potentially suffer from the grounding problem. Recent work has shown promising results in improving the qualities of sentence representations by jointly training them with…

Computation and Language · Computer Science 2017-12-05 Kang Min Yoo , Youhyun Shin , Sang-goo Lee

Image-text matching has been a hot research topic bridging the vision and language areas. It remains challenging because the current representation of image usually lacks global semantic concepts as in its corresponding text caption. To…

Computer Vision and Pattern Recognition · Computer Science 2019-09-09 Kunpeng Li , Yulun Zhang , Kai Li , Yuanyuan Li , Yun Fu

Large vision-and-language models (VLMs) trained to match images with text on large-scale datasets of image-text pairs have shown impressive generalization ability on several vision and language tasks. Several recent works, however, showed…

Computer Vision and Pattern Recognition · Computer Science 2024-03-07 Navid Rajabi , Jana Kosecka

When automatically generating a sentence description for an image or video, it often remains unclear how well the generated caption is grounded, that is whether the model uses the correct image regions to output particular words, or if the…

Computer Vision and Pattern Recognition · Computer Science 2020-07-21 Chih-Yao Ma , Yannis Kalantidis , Ghassan AlRegib , Peter Vajda , Marcus Rohrbach , Zsolt Kira

Visual grounding is a ubiquitous building block in many vision-language tasks and yet remains challenging due to large variations in visual and linguistic features of grounding entities, strong context effect and the resulting semantic…

Computer Vision and Pattern Recognition · Computer Science 2019-11-26 Yongfei Liu , Bo Wan , Xiaodan Zhu , Xuming He

Visual grounding, i.e., localizing objects in images according to natural language queries, is an important topic in visual language understanding. The most effective approaches for this task are based on deep learning, which generally…

Computer Vision and Pattern Recognition · Computer Science 2022-11-18 Haojun Jiang , Yuanze Lin , Dongchen Han , Shiji Song , Gao Huang

Visually grounded speech models learn from images paired with spoken captions. By tagging images with soft text labels using a trained visual classifier with a fixed vocabulary, previous work has shown that it is possible to train a model…

Computation and Language · Computer Science 2021-06-24 Kayode Olaleye , Herman Kamper

Using only image-sentence pairs, weakly-supervised visual-textual grounding aims to learn region-phrase correspondences of the respective entity mentions. Compared to the supervised approach, learning is more difficult since bounding boxes…

Computer Vision and Pattern Recognition · Computer Science 2023-09-27 Davide Rigoni , Luca Parolari , Luciano Serafini , Alessandro Sperduti , Lamberto Ballan

Visual grounding tasks aim to localize image regions based on natural language references. In this work, we explore whether generative VLMs predominantly trained on image-text data could be leveraged to scale up the text annotation of…

Computer Vision and Pattern Recognition · Computer Science 2024-07-23 Shijie Wang , Dahun Kim , Ali Taalimi , Chen Sun , Weicheng Kuo

Multimodal automatic speech recognition systems integrate information from images to improve speech recognition quality, by grounding the speech in the visual context. While visual signals have been shown to be useful for recovering…

Computation and Language · Computer Science 2020-10-07 Tejas Srinivasan , Ramon Sanabria , Florian Metze , Desmond Elliott

Existing visual explanation generating agents learn to fluently justify a class prediction. However, they may mention visual attributes which reflect a strong class prior, although the evidence may not actually be in the image. This is…

Computer Vision and Pattern Recognition · Computer Science 2018-08-03 Lisa Anne Hendricks , Ronghang Hu , Trevor Darrell , Zeynep Akata
‹ Prev 1 2 3 10 Next ›