Related papers: Efficient Multi-Modal Embeddings from Structured D…

Seeing the advantage: visually grounding word embeddings to better capture human semantic knowledge

Distributional semantic models capture word-level meaning that is useful in many natural language processing tasks and have even been shown to capture cognitive aspects of word meaning. The majority of these models are purely text based,…

Computation and Language · Computer Science 2022-03-31 Danny Merkx , Stefan L. Frank , Mirjam Ernestus

Learning Multi-Modal Word Representation Grounded in Visual Context

Representing the semantics of words is a long-standing problem for the natural language processing community. Most methods compute word semantics given their textual context in large corpora. More recently, researchers attempted to…

Computation and Language · Computer Science 2017-11-10 Éloi Zablocki , Benjamin Piwowarski , Laure Soulier , Patrick Gallinari

Leverage Points in Modality Shifts: Comparing Language-only and Multimodal Word Representations

Multimodal embeddings aim to enrich the semantic information in neural representations of language compared to text-only models. While different embeddings exhibit different applicability and performance on downstream tasks, little is known…

Computation and Language · Computer Science 2023-06-06 Aleksey Tikhonov , Lisa Bylinina , Denis Paperno

Visual-Semantic Embedding Model Informed by Structured Knowledge

We propose a novel approach to improve a visual-semantic embedding model by incorporating concept representations captured from an external structured knowledge base. We investigate its performance on image classification under both…

Computer Vision and Pattern Recognition · Computer Science 2020-09-22 Mirantha Jayathilaka , Tingting Mu , Uli Sattler

Learning semantic sentence representations from visually grounded language without lexical knowledge

Current approaches to learning semantic representations of sentences often use prior word-level knowledge. The current study aims to leverage visual information in order to capture sentence level semantics without the need for word…

Computation and Language · Computer Science 2019-09-25 Danny Merkx , Stefan Frank

Learning Structured Semantic Embeddings for Visual Recognition

Numerous embedding models have been recently explored to incorporate semantic knowledge into visual recognition. Existing methods typically focus on minimizing the distance between the corresponding images and texts in the embedding space…

Computer Vision and Pattern Recognition · Computer Science 2017-06-06 Dong Li , Hsin-Ying Lee , Jia-Bin Huang , Shengjin Wang , Ming-Hsuan Yang

Learning Zero-Shot Multifaceted Visually Grounded Word Embeddings via Multi-Task Training

Language grounding aims at linking the symbolic representation of language (e.g., words) into the rich perceptual knowledge of the outside world. The general approach is to embed both textual and visual information into a common space -the…

Computation and Language · Computer Science 2021-09-15 Hassan Shahmohammadi , Hendrik P. A. Lensch , R. Harald Baayen

Probing Multimodal Embeddings for Linguistic Properties: the Visual-Semantic Case

Semantic embeddings have advanced the state of the art for countless natural language processing tasks, and various extensions to multimodal domains, such as visual-semantic embeddings, have been proposed. While the power of visual-semantic…

Machine Learning · Computer Science 2021-02-23 Adam Dahlgren Lindström , Suna Bensch , Johanna Björklund , Frank Drewes

Does Visual Grounding Enhance the Understanding of Embodied Knowledge in Large Language Models?

Despite significant progress in multimodal language models (LMs), it remains unclear whether visual grounding enhances their understanding of embodied knowledge compared to text-only models. To address this question, we propose a novel…

Computation and Language · Computer Science 2025-10-21 Zhihui Yang , Yupei Wang , Kaijie Mo , Zhe Zhao , Renfen Hu

Visual Grounding of Inter-lingual Word-Embeddings

Visual grounding of Language aims at enriching textual representations of language with multiple sources of visual knowledge such as images and videos. Although visual grounding is an area of intense research, inter-lingual aspects of…

Computation and Language · Computer Science 2022-11-22 Wafaa Mohammed , Hassan Shahmohammadi , Hendrik P. A. Lensch , R. Harald Baayen

Incorporating Visual Semantics into Sentence Representations within a Grounded Space

Language grounding is an active field aiming at enriching textual representations with visual information. Generally, textual and visual elements are embedded in the same representation space, which implicitly assumes a one-to-one…

Computation and Language · Computer Science 2020-02-10 Patrick Bordes , Eloi Zablocki , Laure Soulier , Benjamin Piwowarski , Patrick Gallinari

Multi-modal Visual Understanding with Prompts for Semantic Information Disentanglement of Image

Multi-modal visual understanding of images with prompts involves using various visual and textual cues to enhance the semantic understanding of images. This approach combines both vision and language processing to generate more accurate…

Computer Vision and Pattern Recognition · Computer Science 2023-05-17 Yuzhou Peng

A Multimodal Visual Encoding Model Aided by Introducing Verbal Semantic Information

Biological research has revealed that the verbal semantic information in the brain cortex, as an additional source, participates in nonverbal semantic tasks, such as visual encoding. However, previous visual encoding models did not…

Computer Vision and Pattern Recognition · Computer Science 2023-08-30 Shuxiao Ma , Linyuan Wang , Bin Yan

VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning

Human-annotated attributes serve as powerful semantic embeddings in zero-shot learning. However, their annotation process is labor-intensive and needs expert supervision. Current unsupervised semantic embeddings, i.e., word embeddings,…

Computer Vision and Pattern Recognition · Computer Science 2023-05-29 Wenjia Xu , Yongqin Xian , Jiuniu Wang , Bernt Schiele , Zeynep Akata

Using Sparse Semantic Embeddings Learned from Multimodal Text and Image Data to Model Human Conceptual Knowledge

Distributional models provide a convenient way to model semantics using dense embedding spaces derived from unsupervised learning algorithms. However, the dimensions of dense embedding spaces are not designed to resemble human semantic…

Computation and Language · Computer Science 2018-11-15 Steven Derby , Paul Miller , Brian Murphy , Barry Devereux

Multimodal Fact-Checking with Vision Language Models: A Probing Classifier based Solution with Embedding Strategies

This study evaluates the effectiveness of Vision Language Models (VLMs) in representing and utilizing multimodal content for fact-checking. To be more specific, we investigate whether incorporating multimodal content improves performance…

Computation and Language · Computer Science 2024-12-09 Recep Firat Cekinel , Pinar Karagoz , Cagri Coltekin

Image Captioning with Visual Object Representations Grounded in the Textual Modality

We present our work in progress exploring the possibilities of a shared embedding space between textual and visual modality. Leveraging the textual nature of object detection labels and the hypothetical expressiveness of extracted visual…

Computer Vision and Pattern Recognition · Computer Science 2020-10-21 Dušan Variš , Katsuhito Sudoh , Satoshi Nakamura

Multimodal Embeddings from Language Models

Word embeddings such as ELMo have recently been shown to model word semantics with greater efficacy through contextualized learning on large-scale language corpora, resulting in significant improvement in state of the art across many…

Computation and Language · Computer Science 2019-09-11 Shao-Yen Tseng , Panayiotis Georgiou , Shrikanth Narayanan

An Analysis of Semantically-Aligned Speech-Text Embeddings

Embeddings play an important role in end-to-end solutions for multi-modal language processing problems. Although there has been some effort to understand the properties of single-modality embedding spaces, particularly that of text, their…

Computation and Language · Computer Science 2023-01-20 Muhammad Huzaifah , Ivan Kukanov

Language with Vision: a Study on Grounded Word and Sentence Embeddings

Grounding language in vision is an active field of research seeking to construct cognitively plausible word and sentence representations by incorporating perceptual knowledge from vision into text-based representations. Despite many…

Computation and Language · Computer Science 2023-11-01 Hassan Shahmohammadi , Maria Heitmeier , Elnaz Shafaei-Bajestan , Hendrik P. A. Lensch , Harald Baayen