English
Related papers

Related papers: ComAlign: Compositional Alignment in Vision-Langua…

200 papers

Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text comprehension abilities, facilitating advances in several downstream tasks such as zero-shot image classification, image-text retrieval, and text-to-image generation.…

Computer Vision and Pattern Recognition · Computer Science 2024-04-26 Le Zhang , Rabiul Awal , Aishwarya Agrawal

Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for matching images and text. However, it is still challenging to adapt vision-lanaguage pretrained models like CLIP to compositional image and text…

Computer Vision and Pattern Recognition · Computer Science 2024-04-16 Kenan Jiang , Xuehai He , Ruize Xu , Xin Eric Wang

Vision and language models (VLMs) such as CLIP have showcased remarkable zero-shot recognition abilities yet face challenges in visio-linguistic compositionality, particularly in linguistic comprehension and fine-grained image-text…

Computer Vision and Pattern Recognition · Computer Science 2024-06-14 Youngtaek Oh , Pyunghwan Ahn , Jinhyung Kim , Gwangmo Song , Soonyoung Lee , In So Kweon , Junmo Kim

Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance…

Computer Vision and Pattern Recognition · Computer Science 2024-03-04 Santiago Castro , Amir Ziai , Avneesh Saluja , Zhuoning Yuan , Rada Mihalcea

Vision-Language Models (VLMs) have shown remarkable capabilities in a large number of downstream tasks. Nonetheless, compositional image understanding remains a rather difficult task due to the object bias present in training data. In this…

Computer Vision and Pattern Recognition · Computer Science 2024-07-23 Matteo Nulli , Anesa Ibrahimi , Avik Pal , Hoshe Lee , Ivona Najdenkoska

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the…

Computer Vision and Pattern Recognition · Computer Science 2023-07-31 Yi Zhang , Ce Zhang , Yushun Tang , Zhihai He

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks. We argue that this limitation may stem less from deficient representations…

Computer Vision and Pattern Recognition · Computer Science 2026-04-17 Imanol Miranda , Ander Salaberria , Eneko Agirre , Gorka Azkune

Visual imagery does not consist of solitary objects, but instead reflects the composition of a multitude of fluid concepts. While there have been great advances in visual representation learning, such advances have focused on building…

Computer Vision and Pattern Recognition · Computer Science 2025-04-07 Austin Stone , Hagen Soltau , Robert Geirhos , Xi Yi , Ye Xia , Bingyi Cao , Kaifeng Chen , Abhijit Ogale , Jonathon Shlens

Vision-language models like CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions because of their training focus on short and concise captions. We present…

Computer Vision and Pattern Recognition · Computer Science 2025-03-26 Hyungyu Choi , Young Kyun Jang , Chanho Eom

Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual information with their corresponding text descriptions. However, these models have limitations in…

Computer Vision and Pattern Recognition · Computer Science 2025-02-21 Rim Assouel , Pietro Astolfi , Florian Bordes , Michal Drozdzal , Adriana Romero-Soriano

Dual encoder Vision-Language Models (VLM) such as CLIP are widely used for image-text retrieval tasks. However, those models struggle with compositionality, showing a bag-of-words-like behavior that limits their retrieval performance. Many…

Computer Vision and Pattern Recognition · Computer Science 2025-06-12 Imanol Miranda , Ander Salaberria , Eneko Agirre , Gorka Azkune

In the field of vision-language contrastive learning, models such as CLIP capitalize on matched image-caption pairs as positive examples and leverage within-batch non-matching pairs as negatives. This approach has led to remarkable outcomes…

Computer Vision and Pattern Recognition · Computer Science 2024-07-02 Maxwell Aladago , Lorenzo Torresani , Soroush Vosoughi

Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared…

We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs). Traditionally, compositionality has been associated with algebraic operations on embeddings of words from a pre-existing vocabulary.…

Machine Learning · Computer Science 2024-01-12 Matthew Trager , Pramuditha Perera , Luca Zancato , Alessandro Achille , Parminder Bhatia , Stefano Soatto

Vision-Language Models (VLMs) have achieved strong performance on implicit and explicit visual grounding and related tasks. However, such abilities are generally tested on simple, single-object phrases. We find that grounding performance…

Computer Vision and Pattern Recognition · Computer Science 2026-05-08 Jiayun Luo , Mir Rayat Imtiaz Hossain , Pritam Sarkar , Boyang Li , Leonid Sigal

Vision-language models like CLIP have demonstrated remarkable zero-shot capabilities in classification and retrieval. However, these models often struggle with compositional reasoning - the ability to understand the relationships between…

Machine Learning · Computer Science 2025-10-29 Amit Peleg , Naman Deep Singh , Matthias Hein

Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag…

Computer Vision and Pattern Recognition · Computer Science 2025-05-12 Yassine Ouali , Adrian Bulat , Alexandros Xenos , Anestis Zaganidis , Ioannis Maniadis Metaxas , Brais Martinez , Georgios Tzimiropoulos

Vision-language models such as CLIP have shown impressive capabilities in encoding texts and images into aligned embeddings, enabling the retrieval of multimodal data in a shared embedding space. However, these embedding-based models still…

Computer Vision and Pattern Recognition · Computer Science 2024-01-23 Timothy Ossowski , Ming Jiang , Junjie Hu

Vision-language models (VLMs) such as CLIP are trained via contrastive learning between text and image pairs, resulting in aligned image and text embeddings that are useful for many downstream tasks. A notable drawback of CLIP, however, is…

Machine Learning · Computer Science 2025-07-08 Dylan Sam , Devin Willmott , Joao D. Semedo , J. Zico Kolter

Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM). This unified input paradigm enables VLMs to…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Bangzheng Li , Fei Wang , Wenxuan Zhou , Nan Xu , Ben Zhou , Sheng Zhang , Hoifung Poon , Muhao Chen
‹ Prev 1 2 3 10 Next ›