Related papers: Phrase Localization Without Paired Training Exampl…

Adapting CLIP For Phrase Localization Without Further Training

Supervised or weakly supervised methods for phrase localization (textual grounding) either rely on human annotations or some other supervised models, e.g., object detectors. Obtaining these annotations is labor-intensive and may be…

Computer Vision and Pattern Recognition · Computer Science 2022-04-08 Jiahao Li , Greg Shakhnarovich , Raymond A. Yeh

Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues

This paper presents a framework for localization or grounding of phrases in images using a large collection of linguistic and visual cues. We model the appearance, size, and position of entity bounding boxes, adjectives that contain…

Computer Vision and Pattern Recognition · Computer Science 2017-08-10 Bryan A. Plummer , Arun Mallya , Christopher M. Cervantes , Julia Hockenmaier , Svetlana Lazebnik

MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding

Phrase localization is a task that studies the mapping from textual phrases to regions of an image. Given difficulties in annotating phrase-to-object datasets at scale, we develop a Multimodal Alignment Framework (MAF) to leverage more…

Computation and Language · Computer Science 2020-10-13 Qinxin Wang , Hao Tan , Sheng Shen , Michael W. Mahoney , Zhewei Yao

Read, look and detect: Bounding box annotation from image-caption pairs

Various methods have been proposed to detect objects while reducing the cost of data annotation. For instance, weakly supervised object detection (WSOD) methods rely only on image-level annotations during training. Unfortunately, data…

Computer Vision and Pattern Recognition · Computer Science 2023-06-13 Eduardo Hugo Sanchez

Weakly-supervised Visual Grounding of Phrases with Linguistic Structures

We propose a weakly-supervised approach that takes image-sentence pairs as input and learns to visually ground (i.e., localize) arbitrary linguistic phrases, in the form of spatial attention masks. Specifically, the model is trained with…

Computer Vision and Pattern Recognition · Computer Science 2017-05-04 Fanyi Xiao , Leonid Sigal , Yong Jae Lee

Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment

We address the problem of grounding free-form textual phrases by using weak supervision from image-caption pairs. We propose a novel end-to-end model that uses caption-to-image retrieval as a `downstream' task to guide the process of phrase…

Computer Vision and Pattern Recognition · Computer Science 2019-10-16 Samyak Datta , Karan Sikka , Anirban Roy , Karuna Ahuja , Devi Parikh , Ajay Divakaran

Grounding of Textual Phrases in Images by Reconstruction

Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Few datasets provide the ground truth…

Computer Vision and Pattern Recognition · Computer Science 2017-02-21 Anna Rohrbach , Marcus Rohrbach , Ronghang Hu , Trevor Darrell , Bernt Schiele

Contrastive Learning for Weakly Supervised Phrase Grounding

Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on…

Computer Vision and Pattern Recognition · Computer Science 2020-08-07 Tanmay Gupta , Arash Vahdat , Gal Chechik , Xiaodong Yang , Jan Kautz , Derek Hoiem

Revisiting Image-Language Networks for Open-ended Phrase Detection

Most existing work that grounds natural language phrases in images starts with the assumption that the phrase in question is relevant to the image. In this paper we address a more realistic version of the natural language grounding task…

Computer Vision and Pattern Recognition · Computer Science 2020-10-14 Bryan A. Plummer , Kevin J. Shih , Yichen Li , Ke Xu , Svetlana Lazebnik , Stan Sclaroff , Kate Saenko

Towards Unsupervised Image Captioning with Shared Multimodal Embeddings

Understanding images without explicit supervision has become an important problem in computer vision. In this paper, we address image captioning by generating language descriptions of scenes without learning from annotated pairs of images…

Computer Vision and Pattern Recognition · Computer Science 2019-08-27 Iro Laina , Christian Rupprecht , Nassir Navab

Weakly Supervised Attention Learning for Textual Phrases Grounding

Grounding textual phrases in visual content is a meaningful yet challenging problem with various potential applications such as image-text inference or text-driven multimedia interaction. Most of the current existing methods adopt the…

Computer Vision and Pattern Recognition · Computer Science 2018-05-03 Zhiyuan Fang , Shu Kong , Tianshu Yu , Yezhou Yang

Object-Centric Unsupervised Image Captioning

Image captioning is a longstanding problem in the field of computer vision and natural language processing. To date, researchers have produced impressive state-of-the-art performance in the age of deep learning. Most of these…

Computer Vision and Pattern Recognition · Computer Science 2022-07-20 Zihang Meng , David Yang , Xuefei Cao , Ashish Shah , Ser-Nam Lim

What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs

Given an input image, and nothing else, our method returns the bounding boxes of objects in the image and phrases that describe the objects. This is achieved within an open world paradigm, in which the objects in the input image may not…

Computer Vision and Pattern Recognition · Computer Science 2022-06-28 Tal Shaharabany , Yoad Tewel , Lior Wolf

Towards localisation of keywords in speech using weak supervision

Developments in weakly supervised and self-supervised models could enable speech technology in low-resource settings where full transcriptions are not available. We consider whether keyword localisation is possible using two forms of weak…

Computation and Language · Computer Science 2020-12-15 Kayode Olaleye , Benjamin van Niekerk , Herman Kamper

Self-Supervised Feature Learning for Long-Term Metric Visual Localization

Visual localization is the task of estimating camera pose in a known scene, which is an essential problem in robotics and computer vision. However, long-term visual localization is still a challenge due to the environmental appearance…

Robotics · Computer Science 2022-12-02 Yuxuan Chen , Timothy D. Barfoot

Attention-Based Keyword Localisation in Speech using Visual Grounding

Visually grounded speech models learn from images paired with spoken captions. By tagging images with soft text labels using a trained visual classifier with a fixed vocabulary, previous work has shown that it is possible to train a model…

Computation and Language · Computer Science 2021-06-24 Kayode Olaleye , Herman Kamper

Learning to Read by Spelling: Towards Unsupervised Text Recognition

This work presents a method for visual text recognition without using any paired supervisory data. We formulate the text recognition task as one of aligning the conditional distribution of strings predicted from given text images, with…

Computer Vision and Pattern Recognition · Computer Science 2018-12-11 Ankush Gupta , Andrea Vedaldi , Andrew Zisserman

Semi Supervised Phrase Localization in a Bidirectional Caption-Image Retrieval Framework

We introduce a novel deep neural network architecture that links visual regions to corresponding textual segments including phrases and words. To accomplish this task, our architecture makes use of the rich semantic information available in…

Computer Vision and Pattern Recognition · Computer Science 2019-08-09 Deepan Das , Noor Mohammed Ghouse , Shashank Verma , Yin Li

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation

Weakly supervised phrase grounding aims at learning region-phrase correspondences using only image-sentence pairs. A major challenge thus lies in the missing links between image regions and sentence phrases during training. To address this…

Computer Vision and Pattern Recognition · Computer Science 2021-04-27 Liwei Wang , Jing Huang , Yin Li , Kun Xu , Zhengyuan Yang , Dong Yu

Learning to Generate Grounded Visual Captions without Localization Supervision

When automatically generating a sentence description for an image or video, it often remains unclear how well the generated caption is grounded, that is whether the model uses the correct image regions to output particular words, or if the…

Computer Vision and Pattern Recognition · Computer Science 2020-07-21 Chih-Yao Ma , Yannis Kalantidis , Ghassan AlRegib , Peter Vajda , Marcus Rohrbach , Zsolt Kira