Related papers: Learning Visually Grounded Sentence Representation…

Improving Visually Grounded Sentence Representations with Self-Attention

Sentence representation models trained only on language could potentially suffer from the grounding problem. Recent work has shown promising results in improving the qualities of sentence representations by jointly training them with…

Computation and Language · Computer Science 2017-12-05 Kang Min Yoo , Youhyun Shin , Sang-goo Lee

Image Captioning with Visual Object Representations Grounded in the Textual Modality

We present our work in progress exploring the possibilities of a shared embedding space between textual and visual modality. Leveraging the textual nature of object detection labels and the hypothetical expressiveness of extracted visual…

Computer Vision and Pattern Recognition · Computer Science 2020-10-21 Dušan Variš , Katsuhito Sudoh , Satoshi Nakamura

Learning semantic sentence representations from visually grounded language without lexical knowledge

Current approaches to learning semantic representations of sentences often use prior word-level knowledge. The current study aims to leverage visual information in order to capture sentence level semantics without the need for word…

Computation and Language · Computer Science 2019-09-25 Danny Merkx , Stefan Frank

Learning to Generate Grounded Visual Captions without Localization Supervision

When automatically generating a sentence description for an image or video, it often remains unclear how well the generated caption is grounded, that is whether the model uses the correct image regions to output particular words, or if the…

Computer Vision and Pattern Recognition · Computer Science 2020-07-21 Chih-Yao Ma , Yannis Kalantidis , Ghassan AlRegib , Peter Vajda , Marcus Rohrbach , Zsolt Kira

Contrastive Learning for Weakly Supervised Phrase Grounding

Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on…

Computer Vision and Pattern Recognition · Computer Science 2020-08-07 Tanmay Gupta , Arash Vahdat , Gal Chechik , Xiaodong Yang , Jan Kautz , Derek Hoiem

Language with Vision: a Study on Grounded Word and Sentence Embeddings

Grounding language in vision is an active field of research seeking to construct cognitively plausible word and sentence representations by incorporating perceptual knowledge from vision into text-based representations. Despite many…

Computation and Language · Computer Science 2023-11-01 Hassan Shahmohammadi , Maria Heitmeier , Elnaz Shafaei-Bajestan , Hendrik P. A. Lensch , Harald Baayen

Incorporating Visual Semantics into Sentence Representations within a Grounded Space

Language grounding is an active field aiming at enriching textual representations with visual information. Generally, textual and visual elements are embedded in the same representation space, which implicitly assumes a one-to-one…

Computation and Language · Computer Science 2020-02-10 Patrick Bordes , Eloi Zablocki , Laure Soulier , Benjamin Piwowarski , Patrick Gallinari

Discovering Meaningful Units with Visually Grounded Semantics from Image Captions

Fine-grained knowledge is crucial for vision-language models to obtain a better understanding of the real world. While there has been work trying to acquire this kind of knowledge in the space of vision and language, it has mostly focused…

Computer Vision and Pattern Recognition · Computer Science 2025-11-17 Melika Behjati , James Henderson

Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings

Semantic representation learning for sentences is an important and well-studied problem in NLP. The current trend for this task involves training a Transformer-based sentence encoder through a contrastive objective with text, i.e.,…

Computation and Language · Computer Science 2022-09-21 Yiren Jian , Chongyang Gao , Soroush Vosoughi

Visually Grounded Compound PCFGs

Exploiting visual groundings for language understanding has recently been drawing much attention. In this work, we study visually grounded grammar induction and learn a constituency parser from both unlabeled text and its visual groundings.…

Computation and Language · Computer Science 2020-12-08 Yanpeng Zhao , Ivan Titov

Semantic Sentence Embeddings for Paraphrasing and Text Summarization

This paper introduces a sentence to vector encoding framework suitable for advanced natural language processing. Our latent representation is shown to encode sentences with common semantic information with similar vector representations.…

Computation and Language · Computer Science 2018-09-30 Chi Zhang , Shagan Sah , Thang Nguyen , Dheeraj Peri , Alexander Loui , Carl Salvaggio , Raymond Ptucha

Conditional Image-Text Embedding Networks

This paper presents an approach for grounding phrases in images which jointly learns multiple text-conditioned embeddings in a single end-to-end model. In order to differentiate text phrases into semantically distinct subspaces, we propose…

Computer Vision and Pattern Recognition · Computer Science 2018-07-31 Bryan A. Plummer , Paige Kordas , M. Hadi Kiapour , Shuai Zheng , Robinson Piramuthu , Svetlana Lazebnik

Language learning using Speech to Image retrieval

Humans learn language by interaction with their environment and listening to other humans. It should also be possible for computational models to learn language directly from speech but so far most approaches require text. We improve on…

Computation and Language · Computer Science 2019-09-25 Danny Merkx , Stefan L. Frank , Mirjam Ernestus

Contextual Grounding of Natural Language Entities in Images

In this paper, we introduce a contextual grounding approach that captures the context in corresponding text entities and image regions to improve the grounding accuracy. Specifically, the proposed architecture accepts pre-trained text token…

Computer Vision and Pattern Recognition · Computer Science 2019-11-07 Farley Lai , Ning Xie , Derek Doran , Asim Kadav

Learning to Recognise Words using Visually Grounded Speech

We investigated word recognition in a Visually Grounded Speech model. The model has been trained on pairs of images and spoken captions to create visually grounded embeddings which can be used for speech to image retrieval and vice versa.…

Computation and Language · Computer Science 2020-06-02 Sebastiaan Scholten , Danny Merkx , Odette Scharenborg

Visually Grounded Word Embeddings and Richer Visual Features for Improving Multimodal Neural Machine Translation

In Multimodal Neural Machine Translation (MNMT), a neural model generates a translated sentence that describes an image, given the image itself and one source descriptions in English. This is considered as the multimodal image caption…

Computation and Language · Computer Science 2018-06-01 Jean-Benoit Delbrouck , Stéphane Dupont , Omar Seddati

Towards Unsupervised Image Captioning with Shared Multimodal Embeddings

Understanding images without explicit supervision has become an important problem in computer vision. In this paper, we address image captioning by generating language descriptions of scenes without learning from annotated pairs of images…

Computer Vision and Pattern Recognition · Computer Science 2019-08-27 Iro Laina , Christian Rupprecht , Nassir Navab

A sequential guiding network with attention for image captioning

The recent advances of deep learning in both computer vision (CV) and natural language processing (NLP) provide us a new way of understanding semantics, by which we can deal with more challenging tasks such as automatic description…

Computer Vision and Pattern Recognition · Computer Science 2019-02-12 Daouda Sow , Zengchang Qin , Mouhamed Niasse , Tao Wan

The Curious Case of Visual Grounding: Different Effects for Speech- and Text-based Language Encoders

How does visual information included in training affect language processing in audio- and text-based deep learning models? We explore how such visual grounding affects model-internal representations of words, and find substantially…

Computation and Language · Computer Science 2025-09-22 Adrian Sauter , Willem Zuidema , Marianne de Heer Kloots

Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding

We address the problem of phrase grounding by lear ing a multi-level common semantic space shared by the textual and visual modalities. We exploit multiple levels of feature maps of a Deep Convolutional Neural Network, as well as…

Computer Vision and Pattern Recognition · Computer Science 2019-05-31 Hassan Akbari , Svebor Karaman , Surabhi Bhargava , Brian Chen , Carl Vondrick , Shih-Fu Chang