Related papers: ComAlign: Compositional Alignment in Vision-Langua…

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text comprehension abilities, facilitating advances in several downstream tasks such as zero-shot image classification, image-text retrieval, and text-to-image generation.…

Computer Vision and Pattern Recognition · Computer Science 2024-04-26 Le Zhang , Rabiul Awal , Aishwarya Agrawal

ComCLIP: Training-Free Compositional Image and Text Matching

Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for matching images and text. However, it is still challenging to adapt vision-lanaguage pretrained models like CLIP to compositional image and text…

Computer Vision and Pattern Recognition · Computer Science 2024-04-16 Kenan Jiang , Xuehai He , Ruize Xu , Xin Eric Wang

Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition

Vision and language models (VLMs) such as CLIP have showcased remarkable zero-shot recognition abilities yet face challenges in visio-linguistic compositionality, particularly in linguistic comprehension and fine-grained image-text…

Computer Vision and Pattern Recognition · Computer Science 2024-06-14 Youngtaek Oh , Pyunghwan Ahn , Jinhyung Kim , Gwangmo Song , Soonyoung Lee , In So Kweon , Junmo Kim

CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models

Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance…

Computer Vision and Pattern Recognition · Computer Science 2024-03-04 Santiago Castro , Amir Ziai , Avneesh Saluja , Zhuoning Yuan , Rada Mihalcea

In-Context Learning Improves Compositional Understanding of Vision-Language Models

Vision-Language Models (VLMs) have shown remarkable capabilities in a large number of downstream tasks. Nonetheless, compositional image understanding remains a rather difficult task due to the object bias present in training data. In this…

Computer Vision and Pattern Recognition · Computer Science 2024-07-23 Matteo Nulli , Anesa Ibrahimi , Avik Pal , Hoshe Lee , Ivona Najdenkoska

Cross-Modal Concept Learning and Inference for Vision-Language Models

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the…

Computer Vision and Pattern Recognition · Computer Science 2023-07-31 Yi Zhang , Ce Zhang , Yushun Tang , Zhihai He

Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks. We argue that this limitation may stem less from deficient representations…

Computer Vision and Pattern Recognition · Computer Science 2026-04-17 Imanol Miranda , Ander Salaberria , Eneko Agirre , Gorka Azkune

Learning Visual Composition through Improved Semantic Guidance

Visual imagery does not consist of solitary objects, but instead reflects the composition of a multitude of fluid concepts. While there have been great advances in visual representation learning, such advances have focused on building…

Computer Vision and Pattern Recognition · Computer Science 2025-04-07 Austin Stone , Hagen Soltau , Robert Geirhos , Xi Yi , Ye Xia , Bingyi Cao , Kaifeng Chen , Abhijit Ogale , Jonathon Shlens

GOAL: Global-local Object Alignment Learning

Vision-language models like CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions because of their training focus on short and concise captions. We present…

Computer Vision and Pattern Recognition · Computer Science 2025-03-26 Hyungyu Choi , Young Kyun Jang , Chanho Eom

Object-centric Binding in Contrastive Language-Image Pretraining

Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual information with their corresponding text descriptions. However, these models have limitations in…

Computer Vision and Pattern Recognition · Computer Science 2025-02-21 Rim Assouel , Pietro Astolfi , Florian Bordes , Michal Drozdzal , Adriana Romero-Soriano

Adding simple structure at inference improves Vision-Language Compositionality

Dual encoder Vision-Language Models (VLM) such as CLIP are widely used for image-text retrieval tasks. However, those models struggle with compositionality, showing a bag-of-words-like behavior that limits their retrieval performance. Many…

Computer Vision and Pattern Recognition · Computer Science 2025-06-12 Imanol Miranda , Ander Salaberria , Eneko Agirre , Gorka Azkune

Semantic Compositions Enhance Vision-Language Contrastive Learning

In the field of vision-language contrastive learning, models such as CLIP capitalize on matched image-caption pairs as positive examples and leverage within-batch non-matching pairs as negatives. This approach has led to remarkable outcomes…

Computer Vision and Pattern Recognition · Computer Science 2024-07-02 Maxwell Aladago , Lorenzo Torresani , Soroush Vosoughi

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding

Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared…

Computation and Language · Computer Science 2025-11-04 Ahmed Masry , Juan A. Rodriguez , Tianyu Zhang , Suyuchen Wang , Chao Wang , Aarash Feizi , Akshay Kalkunte Suresh , Abhay Puri , Xiangru Jian , Pierre-André Noël , Sathwik Tejaswi Madhusudhan , Marco Pedersoli , Bang Liu , Nicolas Chapados , Yoshua Bengio , Enamul Hoque , Christopher Pal , Issam H. Laradji , David Vazquez , Perouz Taslakian , Spandana Gella , Sai Rajeswar

Linear Spaces of Meanings: Compositional Structures in Vision-Language Models

We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs). Traditionally, compositionality has been associated with algebraic operations on embeddings of words from a pre-existing vocabulary.…

Machine Learning · Computer Science 2024-01-12 Matthew Trager , Pramuditha Perera , Luca Zancato , Alessandro Achille , Parminder Bhatia , Stefano Soatto

The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding

Vision-Language Models (VLMs) have achieved strong performance on implicit and explicit visual grounding and related tasks. However, such abilities are generally tested on simple, single-object phrases. We find that grounding performance…

Computer Vision and Pattern Recognition · Computer Science 2026-05-08 Jiayun Luo , Mir Rayat Imtiaz Hossain , Pritam Sarkar , Boyang Li , Leonid Sigal

Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning

Vision-language models like CLIP have demonstrated remarkable zero-shot capabilities in classification and retrieval. However, these models often struggle with compositional reasoning - the ability to understand the relationships between…

Machine Learning · Computer Science 2025-10-29 Amit Peleg , Naman Deep Singh , Matthias Hein

VladVA: Discriminative Fine-tuning of LVLMs

Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag…

Computer Vision and Pattern Recognition · Computer Science 2025-05-12 Yassine Ouali , Adrian Bulat , Alexandros Xenos , Anestis Zaganidis , Ioannis Maniadis Metaxas , Brais Martinez , Georgios Tzimiropoulos

Prompting Large Vision-Language Models for Compositional Reasoning

Vision-language models such as CLIP have shown impressive capabilities in encoding texts and images into aligned embeddings, enabling the retrieval of multimodal data in a shared embedding space. However, these embedding-based models still…

Computer Vision and Pattern Recognition · Computer Science 2024-01-23 Timothy Ossowski , Ming Jiang , Junjie Hu

Finetuning CLIP to Reason about Pairwise Differences

Vision-language models (VLMs) such as CLIP are trained via contrastive learning between text and image pairs, resulting in aligned image and text embeddings that are useful for many downstream tasks. A notable drawback of CLIP, however, is…

Machine Learning · Computer Science 2025-07-08 Dylan Sam , Devin Willmott , Joao D. Semedo , J. Zico Kolter

Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection

Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM). This unified input paradigm enables VLMs to…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Bangzheng Li , Fei Wang , Wenxuan Zhou , Nan Xu , Ben Zhou , Sheng Zhang , Hoifung Poon , Muhao Chen