English
Related papers

Related papers: Learning Customized Visual Models with Retrieval-A…

200 papers

Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, they still struggle on fine-grained entities which are rare, or even absent from…

Computer Vision and Pattern Recognition · Computer Science 2024-02-22 Ahmet Iscen , Mathilde Caron , Alireza Fathi , Cordelia Schmid

Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and…

Computer Vision and Pattern Recognition · Computer Science 2024-04-25 Sachin Mehta , Maxwell Horton , Fartash Faghri , Mohammad Hossein Sekhavat , Mahyar Najibi , Mehrdad Farajtabar , Oncel Tuzel , Mohammad Rastegari

Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Given that recent…

Computer Vision and Pattern Recognition · Computer Science 2023-08-23 Alberto Baldrati , Marco Bertini , Tiberio Uricchio , Alberto del Bimbo

Contrastive Language-Image Pretraining (CLIP) models excel at understanding image-text relationships but struggle with adapting to new data without forgetting prior knowledge. To address this, models are typically fine-tuned using both new…

Machine Learning · Computer Science 2026-05-06 Ryan King , Gang Li , Bobak Mortazavi , Tianbao Yang

Photo search, the task of retrieving images based on textual queries, has witnessed significant advancements with the introduction of CLIP (Contrastive Language-Image Pretraining) model. CLIP leverages a vision-language pre training…

Computer Vision and Pattern Recognition · Computer Science 2024-01-25 Naresh Kumar Lahajal , Harini S

Contrastive vision-language models, such as CLIP, have garnered considerable attention for various downstream tasks, mainly due to the remarkable ability of the learned features for generalization. However, the features they learned often…

Computer Vision and Pattern Recognition · Computer Science 2025-04-24 Yichao Cai , Yuhang Liu , Zhen Zhang , Javen Qinfeng Shi

The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets, which contributes to its compute and data inefficiency. To…

Computer Vision and Pattern Recognition · Computer Science 2022-04-12 Alex Andonian , Shixing Chen , Raffay Hamid

Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective and scalable methods for training transferable vision models using paired image and text data. CLIP models are trained using contrastive loss, which…

Computer Vision and Pattern Recognition · Computer Science 2023-10-31 Lijie Fan , Dilip Krishnan , Phillip Isola , Dina Katabi , Yonglong Tian

State-of-the-art empirical work has shown that visual representations learned by deep neural networks are robust in nature and capable of performing classification tasks on diverse datasets. For example, CLIP demonstrated zero-shot transfer…

Computer Vision and Pattern Recognition · Computer Science 2023-03-14 Chanda Grover , Indra Deep Mastan , Debayan Gupta

Large-scale Pre-Training Vision-Language Model such as CLIP has demonstrated outstanding performance in zero-shot classification, e.g. achieving 76.3% top-1 accuracy on ImageNet without seeing any example, which leads to potential benefits…

Computer Vision and Pattern Recognition · Computer Science 2023-12-15 Xuefeng Hu , Ke Zhang , Lu Xia , Albert Chen , Jiajia Luo , Yuyin Sun , Ken Wang , Nan Qiao , Xiao Zeng , Min Sun , Cheng-Hao Kuo , Ram Nevatia

Contrastive learning has emerged as an efficient framework to learn multimodal representations. CLIP, a seminal work in this area, achieved impressive results by training on paired image-text data using the contrastive loss. Recent work…

Computer Vision and Pattern Recognition · Computer Science 2023-11-07 Enrico Fini , Pietro Astolfi , Adriana Romero-Soriano , Jakob Verbeek , Michal Drozdzal

Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related…

Multimodal models, such as the Contrastive Language-Image Pre-training (CLIP) model, have demonstrated remarkable success in aligning visual and linguistic representations. However, these models exhibit limitations when applied to…

Computer Vision and Pattern Recognition · Computer Science 2026-03-02 Hiroshi Sasaki

The contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts, enabling effective zero-shot image recognition. Nevertheless, few-shot learning methods based on CLIP…

Computer Vision and Pattern Recognition · Computer Science 2024-01-12 Cheng Cheng , Lin Song , Ruoyi Xue , Hang Wang , Hongbin Sun , Yixiao Ge , Ying Shan

Recent adaptations can boost the low-shot capability of Contrastive Vision-Language Pre-training (CLIP) by effectively facilitating knowledge transfer. However, these adaptation methods are usually operated on the global view of an input…

Computer Vision and Pattern Recognition · Computer Science 2024-07-22 Jinda Lu , Shuo Wang , Yanbin Hao , Haifeng Liu , Xiang Wang , Meng Wang

Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving…

Computer Vision and Pattern Recognition · Computer Science 2024-03-21 Siddharth Joshi , Arnav Jain , Ali Payani , Baharan Mirzasoleiman

Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain. While large-scale pre-trained models are…

Computer Vision and Pattern Recognition · Computer Science 2022-05-04 Sanjay Subramanian , William Merrill , Trevor Darrell , Matt Gardner , Sameer Singh , Anna Rohrbach

CLIP (Contrastive Language-Image Pre-training) uses contrastive learning from noise image-text pairs to excel at recognizing a wide array of candidates, yet its focus on broad associations hinders the precision in distinguishing subtle…

Computer Vision and Pattern Recognition · Computer Science 2026-05-18 Ziyu Liu , Zeyi Sun , Yuhang Zang , Wei Li , Pan Zhang , Xiaoyi Dong , Yuanjun Xiong , Dahua Lin , Jiaqi Wang

Existing computer vision research in artwork struggles with artwork's fine-grained attributes recognition and lack of curated annotated datasets due to their costly creation. To the best of our knowledge, we are one of the first methods to…

Computer Vision and Pattern Recognition · Computer Science 2022-05-02 Marcos V. Conde , Kerem Turgutlu

Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize…

Computer Vision and Pattern Recognition · Computer Science 2025-12-17 Weiheng Zhao , Zilong Huang , Jiashi Feng , Xinggang Wang
‹ Prev 1 2 3 10 Next ›