Related papers: I0T: Embedding Standardization Method Towards Zero…

It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap

Multi-modal contrastive models such as CLIP achieve state-of-the-art performance in zero-shot classification by embedding input images and texts on a joint representational space. Recently, a modality gap has been reported in two-encoder…

Computer Vision and Pattern Recognition · Computer Science 2024-06-10 Abrar Fahim , Alex Murphy , Alona Fyshe

Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

Image captioning aims at generating descriptive and meaningful textual descriptions of images, enabling a broad range of vision-language applications. Prior works have demonstrated that harnessing the power of Contrastive Image Language…

Computer Vision and Pattern Recognition · Computer Science 2024-01-05 Longtian Qiu , Shan Ning , Xuming He

Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP

Contrastive Language--Image Pre-training (CLIP) has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks. Yet, from a geometrical point of view, the CLIP embedding space has been found to have…

Computer Vision and Pattern Recognition · Computer Science 2024-09-17 Sedigheh Eslami , Gerard de Melo

Interpreting and Analysing CLIP's Zero-Shot Image Classification via Mutual Knowledge

Contrastive Language-Image Pretraining (CLIP) performs zero-shot image classification by mapping images and textual class representation into a shared embedding space, then retrieving the class closest to the image. This work provides a new…

Computer Vision and Pattern Recognition · Computer Science 2024-12-19 Fawaz Sammani , Nikos Deligiannis

Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion

Pre-trained multi-modal Vision-Language Models like CLIP are widely used off-the-shelf for a variety of applications. In this paper, we show that the common practice of individually exploiting the text or image encoders of these powerful…

Computer Vision and Pattern Recognition · Computer Science 2025-02-07 Marco Mistretta , Alberto Baldrati , Lorenzo Agnolucci , Marco Bertini , Andrew D. Bagdanov

DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training

Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate strong zero-shot transfer capability in many discriminative tasks. Their adaptation to zero-shot image-conditioned text generation tasks has drawn increasing interest.…

Computer Vision and Pattern Recognition · Computer Science 2023-03-07 Wei Li , Linchao Zhu , Longyin Wen , Yi Yang

Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning

Multimodal learning has recently gained significant popularity, demonstrating impressive performance across various zero-shot classification tasks and a range of perceptive and generative applications. Models such as Contrastive…

Machine Learning · Computer Science 2026-02-16 Can Yaras , Siyi Chen , Peng Wang , Qing Qu

Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment

Contrastive Language and Image Pairing (CLIP), a transformative method in multimedia retrieval, typically trains two neural networks concurrently to generate joint embeddings for text and image pairs. However, when applied directly, these…

Computer Vision and Pattern Recognition · Computer Science 2024-09-04 Konstantin Schall , Kai Uwe Barthel , Nico Hezel , Klaus Jung

Improving Zero-shot Generalization and Robustness of Multi-modal Models

Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks and their zero-shot generalization ability is particularly exciting. While the top-5 zero-shot accuracies of…

Computer Vision and Pattern Recognition · Computer Science 2023-05-26 Yunhao Ge , Jie Ren , Andrew Gallagher , Yuxiao Wang , Ming-Hsuan Yang , Hartwig Adam , Laurent Itti , Balaji Lakshminarayanan , Jiaping Zhao

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text

Existing vision-text contrastive learning like CLIP aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical…

Computer Vision and Pattern Recognition · Computer Science 2022-10-20 Zifeng Wang , Zhenbang Wu , Dinesh Agarwal , Jimeng Sun

Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment

Deep Learning (DL) is undergoing a paradigm shift with the emergence of foundation models. In this work, we focus on Contrastive Language-Image Pre-training (CLIP), a Vision-Language foundation model that achieves high accuracy across…

Computer Vision and Pattern Recognition · Computer Science 2025-07-21 Angelos Zavras , Dimitrios Michail , Begüm Demir , Ioannis Papoutsis

CIBR: Cross-modal Information Bottleneck Regularization for Robust CLIP Generalization

Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success in cross-modal tasks such as zero-shot image classification and text-image retrieval by effectively aligning visual and textual representations. However, the…

Computer Vision and Pattern Recognition · Computer Science 2025-04-01 Yingrui Ji , Xi Xiao , Gaofei Chen , Hao Xu , Chenrui Ma , Lijing Zhu , Aokun Liang , Jiansheng Chen

Contrastive Localized Language-Image Pre-Training

Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone…

Computer Vision and Pattern Recognition · Computer Science 2025-02-20 Hong-You Chen , Zhengfeng Lai , Haotian Zhang , Xinze Wang , Marcin Eichner , Keen You , Meng Cao , Bowen Zhang , Yinfei Yang , Zhe Gan

Post-pre-training for Modality Alignment in Vision-Language Foundation Models

Contrastive language image pre-training (CLIP) is an essential component of building modern vision-language foundation models. While CLIP demonstrates remarkable zero-shot performance on downstream tasks, the multi-modal feature spaces…

Computer Vision and Pattern Recognition · Computer Science 2025-04-18 Shin'ya Yamaguchi , Dewei Feng , Sekitoshi Kanai , Kazuki Adachi , Daiki Chijiwa

Cross-Modal Retrieval Meets Inference:Improving Zero-Shot Classification with Cross-Modal Retrieval

Contrastive language-image pre-training (CLIP) has demonstrated remarkable zero-shot classification ability, namely image classification using novel text labels. Existing works have attempted to enhance CLIP by fine-tuning on downstream…

Computer Vision and Pattern Recognition · Computer Science 2023-08-30 Seongha Eom , Namgyu Ho , Jaehoon Oh , Se-Young Yun

CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision

Conventional object detectors rely on cross-entropy classification, which can be vulnerable to class imbalance and label noise. We propose CLIP-Joint-Detect, a simple and detector-agnostic framework that integrates CLIP-style contrastive…

Computer Vision and Pattern Recognition · Computer Science 2025-12-30 Behnam Raoufi , Hossein Sharify , Mohamad Mahdee Ramezanee , Khosrow Hajsadeghi , Saeed Bagheri Shouraki

SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP

While Contrastive Language-Image Pretraining (CLIP) excels at zero-shot tasks by aligning image and text embeddings, its performance in few-shot classification is hindered by a critical limitation: intra-modal misalignment. This issue,…

Computer Vision and Pattern Recognition · Computer Science 2026-04-10 Christoph Timmermann , Hyunse Lee , Woojin Lee

Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP

Contrastive Language-Image Pretraining (CLIP) has been widely used in vision tasks. Notably, CLIP has demonstrated promising performance in few-shot learning (FSL). However, existing CLIP-based methods in training-free FSL (i.e., without…

Computer Vision and Pattern Recognition · Computer Science 2024-12-17 Yayuan Li , Jintao Guo , Lei Qi , Wenbin Li , Yinghuan Shi

Improved baselines for vision-language pre-training

Contrastive learning has emerged as an efficient framework to learn multimodal representations. CLIP, a seminal work in this area, achieved impressive results by training on paired image-text data using the contrastive loss. Recent work…

Computer Vision and Pattern Recognition · Computer Science 2023-11-07 Enrico Fini , Pietro Astolfi , Adriana Romero-Soriano , Jakob Verbeek , Michal Drozdzal

un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks. However, recent works indicate that CLIP falls short in distinguishing detailed differences in images…

Computer Vision and Pattern Recognition · Computer Science 2025-06-02 Yinqi Li , Jiahe Zhao , Hong Chang , Ruibing Hou , Shiguang Shan , Xilin Chen