English
Related papers

Related papers: I0T: Embedding Standardization Method Towards Zero…

200 papers

Multi-modal contrastive models such as CLIP achieve state-of-the-art performance in zero-shot classification by embedding input images and texts on a joint representational space. Recently, a modality gap has been reported in two-encoder…

Computer Vision and Pattern Recognition · Computer Science 2024-06-10 Abrar Fahim , Alex Murphy , Alona Fyshe

Image captioning aims at generating descriptive and meaningful textual descriptions of images, enabling a broad range of vision-language applications. Prior works have demonstrated that harnessing the power of Contrastive Image Language…

Computer Vision and Pattern Recognition · Computer Science 2024-01-05 Longtian Qiu , Shan Ning , Xuming He

Contrastive Language--Image Pre-training (CLIP) has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks. Yet, from a geometrical point of view, the CLIP embedding space has been found to have…

Computer Vision and Pattern Recognition · Computer Science 2024-09-17 Sedigheh Eslami , Gerard de Melo

Contrastive Language-Image Pretraining (CLIP) performs zero-shot image classification by mapping images and textual class representation into a shared embedding space, then retrieving the class closest to the image. This work provides a new…

Computer Vision and Pattern Recognition · Computer Science 2024-12-19 Fawaz Sammani , Nikos Deligiannis

Pre-trained multi-modal Vision-Language Models like CLIP are widely used off-the-shelf for a variety of applications. In this paper, we show that the common practice of individually exploiting the text or image encoders of these powerful…

Computer Vision and Pattern Recognition · Computer Science 2025-02-07 Marco Mistretta , Alberto Baldrati , Lorenzo Agnolucci , Marco Bertini , Andrew D. Bagdanov

Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate strong zero-shot transfer capability in many discriminative tasks. Their adaptation to zero-shot image-conditioned text generation tasks has drawn increasing interest.…

Computer Vision and Pattern Recognition · Computer Science 2023-03-07 Wei Li , Linchao Zhu , Longyin Wen , Yi Yang

Multimodal learning has recently gained significant popularity, demonstrating impressive performance across various zero-shot classification tasks and a range of perceptive and generative applications. Models such as Contrastive…

Machine Learning · Computer Science 2026-02-16 Can Yaras , Siyi Chen , Peng Wang , Qing Qu

Contrastive Language and Image Pairing (CLIP), a transformative method in multimedia retrieval, typically trains two neural networks concurrently to generate joint embeddings for text and image pairs. However, when applied directly, these…

Computer Vision and Pattern Recognition · Computer Science 2024-09-04 Konstantin Schall , Kai Uwe Barthel , Nico Hezel , Klaus Jung

Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks and their zero-shot generalization ability is particularly exciting. While the top-5 zero-shot accuracies of…

Computer Vision and Pattern Recognition · Computer Science 2023-05-26 Yunhao Ge , Jie Ren , Andrew Gallagher , Yuxiao Wang , Ming-Hsuan Yang , Hartwig Adam , Laurent Itti , Balaji Lakshminarayanan , Jiaping Zhao

Existing vision-text contrastive learning like CLIP aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical…

Computer Vision and Pattern Recognition · Computer Science 2022-10-20 Zifeng Wang , Zhenbang Wu , Dinesh Agarwal , Jimeng Sun

Deep Learning (DL) is undergoing a paradigm shift with the emergence of foundation models. In this work, we focus on Contrastive Language-Image Pre-training (CLIP), a Vision-Language foundation model that achieves high accuracy across…

Computer Vision and Pattern Recognition · Computer Science 2025-07-21 Angelos Zavras , Dimitrios Michail , Begüm Demir , Ioannis Papoutsis

Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success in cross-modal tasks such as zero-shot image classification and text-image retrieval by effectively aligning visual and textual representations. However, the…

Computer Vision and Pattern Recognition · Computer Science 2025-04-01 Yingrui Ji , Xi Xiao , Gaofei Chen , Hao Xu , Chenrui Ma , Lijing Zhu , Aokun Liang , Jiansheng Chen

Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone…

Computer Vision and Pattern Recognition · Computer Science 2025-02-20 Hong-You Chen , Zhengfeng Lai , Haotian Zhang , Xinze Wang , Marcin Eichner , Keen You , Meng Cao , Bowen Zhang , Yinfei Yang , Zhe Gan

Contrastive language image pre-training (CLIP) is an essential component of building modern vision-language foundation models. While CLIP demonstrates remarkable zero-shot performance on downstream tasks, the multi-modal feature spaces…

Computer Vision and Pattern Recognition · Computer Science 2025-04-18 Shin'ya Yamaguchi , Dewei Feng , Sekitoshi Kanai , Kazuki Adachi , Daiki Chijiwa

Contrastive language-image pre-training (CLIP) has demonstrated remarkable zero-shot classification ability, namely image classification using novel text labels. Existing works have attempted to enhance CLIP by fine-tuning on downstream…

Computer Vision and Pattern Recognition · Computer Science 2023-08-30 Seongha Eom , Namgyu Ho , Jaehoon Oh , Se-Young Yun

Conventional object detectors rely on cross-entropy classification, which can be vulnerable to class imbalance and label noise. We propose CLIP-Joint-Detect, a simple and detector-agnostic framework that integrates CLIP-style contrastive…

Computer Vision and Pattern Recognition · Computer Science 2025-12-30 Behnam Raoufi , Hossein Sharify , Mohamad Mahdee Ramezanee , Khosrow Hajsadeghi , Saeed Bagheri Shouraki

While Contrastive Language-Image Pretraining (CLIP) excels at zero-shot tasks by aligning image and text embeddings, its performance in few-shot classification is hindered by a critical limitation: intra-modal misalignment. This issue,…

Computer Vision and Pattern Recognition · Computer Science 2026-04-10 Christoph Timmermann , Hyunse Lee , Woojin Lee

Contrastive Language-Image Pretraining (CLIP) has been widely used in vision tasks. Notably, CLIP has demonstrated promising performance in few-shot learning (FSL). However, existing CLIP-based methods in training-free FSL (i.e., without…

Computer Vision and Pattern Recognition · Computer Science 2024-12-17 Yayuan Li , Jintao Guo , Lei Qi , Wenbin Li , Yinghuan Shi

Contrastive learning has emerged as an efficient framework to learn multimodal representations. CLIP, a seminal work in this area, achieved impressive results by training on paired image-text data using the contrastive loss. Recent work…

Computer Vision and Pattern Recognition · Computer Science 2023-11-07 Enrico Fini , Pietro Astolfi , Adriana Romero-Soriano , Jakob Verbeek , Michal Drozdzal

Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks. However, recent works indicate that CLIP falls short in distinguishing detailed differences in images…

Computer Vision and Pattern Recognition · Computer Science 2025-06-02 Yinqi Li , Jiahe Zhao , Hong Chang , Ruibing Hou , Shiguang Shan , Xilin Chen
‹ Prev 1 2 3 10 Next ›