Related papers: Aligning Visual and Lexical Semantics

Building a visual semantics aware object hierarchy

The semantic gap is defined as the difference between the linguistic representations of the same concept, which usually leads to misunderstanding between individuals with different knowledge backgrounds. Since linguistically annotated…

Computer Vision and Pattern Recognition · Computer Science 2022-03-01 Xiaolei Diao

Towards Visual Semantics

Lexical Semantics is concerned with how words encode mental representations of the world, i.e., concepts . We call this type of concepts, classification concepts . In this paper, we focus on Visual Semantics , namely on how humans build…

Artificial Intelligence · Computer Science 2021-09-15 Fausto Giunchiglia , Luca Erculiani , Andrea Passerini

What can Computer Vision learn from Ranganathan?

The Semantic Gap Problem (SGP) in Computer Vision (CV) arises from the misalignment between visual and lexical semantics leading to flawed CV dataset design and CV benchmarks. This paper proposes that classification principles of S.R.…

Computer Vision and Pattern Recognition · Computer Science 2026-02-02 Mayukh Bagchi , Fausto Giunchiglia

Vision-to-Language Tasks Based on Attributes and Attention Mechanism

Vision-to-language tasks aim to integrate computer vision and natural language processing together, which has attracted the attention of many researchers. For typical approaches, they encode image into feature representations and decode it…

Computer Vision and Pattern Recognition · Computer Science 2019-05-30 Xuelong Li , Aihong Yuan , Xiaoqiang Lu

Context Matters: Learning Global Semantics via Object-Centric Representation

Recent advances in language modeling have witnessed the rise of highly desirable emergent capabilities, such as reasoning and in-context learning. However, vision models have yet to exhibit comparable progress in these areas. In this paper,…

Computer Vision and Pattern Recognition · Computer Science 2025-10-10 Jike Zhong , Yuxiang Lai , Xiaofeng Yang , Konstantinos Psounis

Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection

Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM). This unified input paradigm enables VLMs to…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Bangzheng Li , Fei Wang , Wenxuan Zhou , Nan Xu , Ben Zhou , Sheng Zhang , Hoifung Poon , Muhao Chen

SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models

Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information. We introduce SEAM, a benchmark…

Artificial Intelligence · Computer Science 2025-08-26 Zhenwei Tang , Difan Jiao , Blair Yang , Ashton Anderson

VISaGE: Understanding Visual Generics and Exceptions

While Vision Language Models (VLMs) learn conceptual representations, in the form of generalized knowledge, during training, they are typically used to analyze individual instances. When evaluation instances are atypical, this paradigm…

Computation and Language · Computer Science 2025-10-15 Stella Frank , Emily Allaway

Do Vision and Language Encoders Represent the World Similarly?

Aligned text-image encoders such as CLIP have become the de facto model for vision-language tasks. Furthermore, modality-specific encoders achieve impressive performances in their respective domains. This raises a central question: does an…

Computer Vision and Pattern Recognition · Computer Science 2024-03-26 Mayug Maniparambil , Raiymbek Akshulakov , Yasser Abdelaziz Dahou Djilali , Sanath Narayan , Mohamed El Amine Seddik , Karttikeya Mangalam , Noel E. O'Connor

Understanding Knowledge Gaps in Visual Question Answering: Implications for Gap Identification and Testing

Visual Question Answering (VQA) systems are tasked with answering natural language questions corresponding to a presented image. Traditional VQA datasets typically contain questions related to the spatial information of objects, object…

Computation and Language · Computer Science 2020-06-05 Goonmeet Bajaj , Bortik Bandyopadhyay , Daniel Schmidt , Pranav Maneriker , Christopher Myers , Srinivasan Parthasarathy

Learning to Model Multimodal Semantic Alignment for Story Visualization

Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story, where the images should be realistic and keep global consistency across dynamic scenes and characters. Current works face the…

Computer Vision and Pattern Recognition · Computer Science 2022-11-15 Bowen Li , Thomas Lukasiewicz

Visual Semantic Information Pursuit: A Survey

Visual semantic information comprises two important parts: the meaning of each visual semantic unit and the coherent visual semantic relation conveyed by these visual semantic units. Essentially, the former one is a visual perception task…

Computer Vision and Pattern Recognition · Computer Science 2019-03-14 Daqi Liu , Miroslaw Bober , Josef Kittler

Aligning Linguistic Words and Visual Semantic Units for Image Captioning

Image captioning attempts to generate a sentence composed of several linguistic words, which are used to describe objects, attributes, and interactions in an image, denoted as visual semantic units in this paper. Based on this view, we…

Computer Vision and Pattern Recognition · Computer Science 2019-08-07 Longteng Guo , Jing Liu , Jinhui Tang , Jiangwei Li , Wei Luo , Hanqing Lu

Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

In natural language processing, most models try to learn semantic representations merely from texts. The learned representations encode the distributional semantics but fail to connect to any knowledge about the physical world. In contrast,…

Computation and Language · Computer Science 2021-11-16 Yizhen Zhang , Minkyu Choi , Kuan Han , Zhongming Liu

Not just a matter of semantics: the relationship between visual similarity and semantic similarity

Knowledge transfer, zero-shot learning and semantic image retrieval are methods that aim at improving accuracy by utilizing semantic information, e.g. from WordNet. It is assumed that this information can augment or replace missing visual…

Computer Vision and Pattern Recognition · Computer Science 2019-06-03 Clemens-Alexander Brust , Joachim Denzler

Causal Graphical Models for Vision-Language Compositional Understanding

Recent work has empirically shown that Vision-Language Models (VLMs) struggle to fully understand the compositional properties of the human language, usually modeling an image caption as a "bag of words". As a result, they perform poorly on…

Computer Vision and Pattern Recognition · Computer Science 2025-04-16 Fiorenzo Parascandolo , Nicholas Moratelli , Enver Sangineto , Lorenzo Baraldi , Rita Cucchiara

Bridging the Gap between Local Semantic Concepts and Bag of Visual Words for Natural Scene Image Retrieval

This paper addresses the problem of semantic-based image retrieval of natural scenes. A typical content-based image retrieval system deals with the query image and images in the dataset as a collection of low-level features and retrieves a…

Computer Vision and Pattern Recognition · Computer Science 2022-10-18 Yousef Alqasrawi

SemVink: Advancing VLMs' Semantic Understanding of Optical Illusions via Visual Global Thinking

Vision-language models (VLMs) excel in semantic tasks but falter at a core human capability: detecting hidden content in optical illusions or AI-generated images through perceptual adjustments like zooming. We introduce HC-Bench, a…

Computation and Language · Computer Science 2025-10-16 Sifan Li , Yujun Cai , Yiwei Wang

Learning Multi-Modal Word Representation Grounded in Visual Context

Representing the semantics of words is a long-standing problem for the natural language processing community. Most methods compute word semantics given their textual context in large corpora. More recently, researchers attempted to…

Computation and Language · Computer Science 2017-11-10 Éloi Zablocki , Benjamin Piwowarski , Laure Soulier , Patrick Gallinari

Representations in vision and language converge in a shared, multidimensional space of perceived similarities

Humans can effortlessly describe what they see, yet establishing a shared representational format between vision and language remains a significant challenge. Emerging evidence suggests that human brain representations in both vision and…

Neurons and Cognition · Quantitative Biology 2025-07-30 Katerina Marie Simkova , Adrien Doerig , Clayton Hickey , Ian Charest