English
Related papers

Related papers: Relation-aware Instance Refinement for Weakly Supe…

200 papers

Using only image-sentence pairs, weakly-supervised visual-textual grounding aims to learn region-phrase correspondences of the respective entity mentions. Compared to the supervised approach, learning is more difficult since bounding boxes…

Computer Vision and Pattern Recognition · Computer Science 2023-09-27 Davide Rigoni , Luca Parolari , Luciano Serafini , Alessandro Sperduti , Lamberto Ballan

Grounding textual phrases in visual content is a meaningful yet challenging problem with various potential applications such as image-text inference or text-driven multimedia interaction. Most of the current existing methods adopt the…

Computer Vision and Pattern Recognition · Computer Science 2018-05-03 Zhiyuan Fang , Shu Kong , Tianshu Yu , Yezhou Yang

We address the problem of grounding free-form textual phrases by using weak supervision from image-caption pairs. We propose a novel end-to-end model that uses caption-to-image retrieval as a `downstream' task to guide the process of phrase…

Computer Vision and Pattern Recognition · Computer Science 2019-10-16 Samyak Datta , Karan Sikka , Anirban Roy , Karuna Ahuja , Devi Parikh , Ajay Divakaran

Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Few datasets provide the ground truth…

Computer Vision and Pattern Recognition · Computer Science 2017-02-21 Anna Rohrbach , Marcus Rohrbach , Ronghang Hu , Trevor Darrell , Bernt Schiele

Visual grounding aims to localize the object referred to in an image based on a natural language query. Although progress has been made recently, accurately localizing target objects within multiple-instance distractions (multiple objects…

Computer Vision and Pattern Recognition · Computer Science 2024-08-30 Minghang Zheng , Jiahua Zhang , Qingchao Chen , Yuxin Peng , Yang Liu

We propose a weakly-supervised approach that takes image-sentence pairs as input and learns to visually ground (i.e., localize) arbitrary linguistic phrases, in the form of spatial attention masks. Specifically, the model is trained with…

Computer Vision and Pattern Recognition · Computer Science 2017-05-04 Fanyi Xiao , Leonid Sigal , Yong Jae Lee

Query-based video grounding is an important yet challenging task in video understanding, which aims to localize the target segment in an untrimmed video according to a sentence query. Most previous works achieve significant progress by…

Computer Vision and Pattern Recognition · Computer Science 2022-03-09 Shentong Mo , Daizong Liu , Wei Hu

3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query. Although many approaches have been proposed and achieved impressive performance, they all require dense object-sentence pair…

Computer Vision and Pattern Recognition · Computer Science 2023-07-19 Zehan Wang , Haifeng Huang , Yang Zhao , Linjun Li , Xize Cheng , Yichen Zhu , Aoxiong Yin , Zhou Zhao

Localizing natural language phrases in images is a challenging problem that requires joint understanding of both the textual and visual modalities. In the unsupervised setting, lack of supervisory signals exacerbate this difficulty. In this…

Computer Vision and Pattern Recognition · Computer Science 2018-11-20 Syed Ashar Javed , Shreyas Saxena , Vineet Gandhi

Recent work in vision-and-language pretraining has investigated supervised signals from object detection data to learn better, fine-grained multimodal representations. In this work, we take a step further and explore how we can tap into…

Computation and Language · Computer Science 2023-10-20 Emanuele Bugliarello , Aida Nematzadeh , Lisa Anne Hendricks

Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on…

Computer Vision and Pattern Recognition · Computer Science 2020-08-07 Tanmay Gupta , Arash Vahdat , Gal Chechik , Xiaodong Yang , Jan Kautz , Derek Hoiem

Weakly supervised referring expression grounding aims at localizing the referential object in an image according to the linguistic query, where the mapping between the referential object and query is unknown in the training stage. To…

Computer Vision and Pattern Recognition · Computer Science 2019-08-29 Xuejing Liu , Liang Li , Shuhui Wang , Zheng-Jun Zha , Dechao Meng , Qingming Huang

This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the moment proposal selection framework that utilizes contrastive learning and reconstruction paradigm for…

Computer Vision and Pattern Recognition · Computer Science 2026-05-27 Xiang Fang , Zeyu Xiong , Wanlong Fang , Xiaoye Qu , Chen Chen , Jianfeng Dong , Keke Tang , Pan Zhou , Yu Cheng , Daizong Liu

Weakly supervised visual grounding aims to predict the region in an image that corresponds to a specific linguistic query, where the mapping between the target object and query is unknown in the training stage. The state-of-the-art method…

Computer Vision and Pattern Recognition · Computer Science 2023-02-23 Viet-Quoc Pham , Nao Mishima

The 3D weakly-supervised visual grounding task aims to localize oriented 3D boxes in point clouds based on natural language descriptions without requiring annotations to guide model learning. This setting presents two primary challenges:…

Computer Vision and Pattern Recognition · Computer Science 2025-05-06 Xiaoqi Li , Jiaming Liu , Nuowei Han , Liang Heng , Yandong Guo , Hao Dong , Yang Liu

We study weakly-supervised video object grounding: given a video segment and a corresponding descriptive sentence, the goal is to localize objects that are mentioned from the sentence in the video. During training, no object bounding boxes…

Computer Vision and Pattern Recognition · Computer Science 2018-07-23 Luowei Zhou , Nathan Louis , Jason J. Corso

Textual grounding, i.e., linking words to objects in images, is a challenging but important task for robotics and human-computer interaction. Existing techniques benefit from recent progress in deep learning and generally formulate the task…

Computer Vision and Pattern Recognition · Computer Science 2018-03-30 Raymond A. Yeh , Minh N. Do , Alexander G. Schwing

We aim to localize objects in images using image-level supervision only. Previous approaches to this problem mainly focus on discriminative object regions and often fail to locate precise object boundaries. We address this problem by…

Computer Vision and Pattern Recognition · Computer Science 2016-09-15 Vadim Kantorov , Maxime Oquab , Minsu Cho , Ivan Laptev

Vision-and-language (V\&L) reasoning necessitates perception of visual concepts such as objects and actions, understanding semantics and language grounding, and reasoning about the interplay between the two modalities. One crucial aspect of…

Computer Vision and Pattern Recognition · Computer Science 2021-09-07 Pratyay Banerjee , Tejas Gokhale , Yezhou Yang , Chitta Baral

Vision-and-language models trained to match images with text can be combined with visual explanation methods to point to the locations of specific objects in an image. Our work shows that the localization --"grounding"-- abilities of these…

Computer Vision and Pattern Recognition · Computer Science 2023-12-08 Ruozhen He , Paola Cascante-Bonilla , Ziyan Yang , Alexander C. Berg , Vicente Ordonez
‹ Prev 1 2 3 10 Next ›