Related papers: Learning Customized Visual Models with Retrieval-A…

Retrieval-Enhanced Contrastive Vision-Text Models

Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, they still struggle on fine-grained entities which are rare, or even absent from…

Computer Vision and Pattern Recognition · Computer Science 2024-02-22 Ahmet Iscen , Mathilde Caron , Alireza Fathi , Cordelia Schmid

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and…

Computer Vision and Pattern Recognition · Computer Science 2024-04-25 Sachin Mehta , Maxwell Horton , Fartash Faghri , Mohammad Hossein Sekhavat , Mahyar Najibi , Mehrdad Farajtabar , Oncel Tuzel , Mohammad Rastegari

Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Given that recent…

Computer Vision and Pattern Recognition · Computer Science 2023-08-23 Alberto Baldrati , Marco Bertini , Tiberio Uricchio , Alberto del Bimbo

Memory-Efficient Continual Learning with CLIP Models

Contrastive Language-Image Pretraining (CLIP) models excel at understanding image-text relationships but struggle with adapting to new data without forgetting prior knowledge. To address this, models are typically fine-tuned using both new…

Machine Learning · Computer Science 2026-05-06 Ryan King , Gang Li , Bobak Mortazavi , Tianbao Yang

Enhancing Image Retrieval : A Comprehensive Study on Photo Search using the CLIP Mode

Photo search, the task of retrieving images based on textual queries, has witnessed significant advancements with the introduction of CLIP (Contrastive Language-Image Pretraining) model. CLIP leverages a vision-language pre training…

Computer Vision and Pattern Recognition · Computer Science 2024-01-25 Naresh Kumar Lahajal , Harini S

CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts

Contrastive vision-language models, such as CLIP, have garnered considerable attention for various downstream tasks, mainly due to the remarkable ability of the learned features for generalization. However, the features they learned often…

Computer Vision and Pattern Recognition · Computer Science 2025-04-24 Yichao Cai , Yuhang Liu , Zhen Zhang , Javen Qinfeng Shi

Robust Cross-Modal Representation Learning with Progressive Self-Distillation

The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets, which contributes to its compute and data inefficiency. To…

Computer Vision and Pattern Recognition · Computer Science 2022-04-12 Alex Andonian , Shixing Chen , Raffay Hamid

Improving CLIP Training with Language Rewrites

Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective and scalable methods for training transferable vision models using paired image and text data. CLIP models are trained using contrastive loss, which…

Computer Vision and Pattern Recognition · Computer Science 2023-10-31 Lijie Fan , Dilip Krishnan , Phillip Isola , Dina Katabi , Yonglong Tian

ContextCLIP: Contextual Alignment of Image-Text pairs on CLIP visual representations

State-of-the-art empirical work has shown that visual representations learned by deep neural networks are robust in nature and capable of performing classification tasks on diverse datasets. For example, CLIP demonstrated zero-shot transfer…

Computer Vision and Pattern Recognition · Computer Science 2023-03-14 Chanda Grover , Indra Deep Mastan , Debayan Gupta

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

Large-scale Pre-Training Vision-Language Model such as CLIP has demonstrated outstanding performance in zero-shot classification, e.g. achieving 76.3% top-1 accuracy on ImageNet without seeing any example, which leads to potential benefits…

Computer Vision and Pattern Recognition · Computer Science 2023-12-15 Xuefeng Hu , Ke Zhang , Lu Xia , Albert Chen , Jiajia Luo , Yuyin Sun , Ken Wang , Nan Qiao , Xiao Zeng , Min Sun , Cheng-Hao Kuo , Ram Nevatia

Improved baselines for vision-language pre-training

Contrastive learning has emerged as an efficient framework to learn multimodal representations. CLIP, a seminal work in this area, achieved impressive results by training on paired image-text data using the contrastive loss. Recent work…

Computer Vision and Pattern Recognition · Computer Science 2023-11-07 Enrico Fini , Pietro Astolfi , Adriana Romero-Soriano , Jakob Verbeek , Michal Drozdzal

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related…

Computation and Language · Computer Science 2024-06-27 Andreas Koukounas , Georgios Mastrapas , Michael Günther , Bo Wang , Scott Martens , Isabelle Mohr , Saba Sturua , Mohammad Kalim Akram , Joan Fontanals Martínez , Saahil Ognawala , Susana Guzman , Maximilian Werk , Nan Wang , Han Xiao

Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models

Multimodal models, such as the Contrastive Language-Image Pre-training (CLIP) model, have demonstrated remarkable success in aligning visual and linguistic representations. However, these models exhibit limitations when applied to…

Computer Vision and Pattern Recognition · Computer Science 2026-03-02 Hiroshi Sasaki

Meta-Adapter: An Online Few-shot Learner for Vision-Language Model

The contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts, enabling effective zero-shot image recognition. Nevertheless, few-shot learning methods based on CLIP…

Computer Vision and Pattern Recognition · Computer Science 2024-01-12 Cheng Cheng , Lin Song , Ruoyi Xue , Hang Wang , Hongbin Sun , Yixiao Ge , Ying Shan

Rethinking Visual Content Refinement in Low-Shot CLIP Adaptation

Recent adaptations can boost the low-shot capability of Contrastive Vision-Language Pre-training (CLIP) by effectively facilitating knowledge transfer. However, these adaptation methods are usually operated on the global view of an input…

Computer Vision and Pattern Recognition · Computer Science 2024-07-22 Jinda Lu , Shuo Wang , Yanbin Hao , Haifeng Liu , Xiang Wang , Meng Wang

Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity

Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving…

Computer Vision and Pattern Recognition · Computer Science 2024-03-21 Siddharth Joshi , Arnav Jain , Ali Payani , Baharan Mirzasoleiman

ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension

Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain. While large-scale pre-trained models are…

Computer Vision and Pattern Recognition · Computer Science 2022-05-04 Sanjay Subramanian , William Merrill , Trevor Darrell , Matt Gardner , Sameer Singh , Anna Rohrbach

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

CLIP (Contrastive Language-Image Pre-training) uses contrastive learning from noise image-text pairs to excel at recognizing a wide array of candidates, yet its focus on broad associations hinders the precision in distinguishing subtle…

Computer Vision and Pattern Recognition · Computer Science 2026-05-18 Ziyu Liu , Zeyi Sun , Yuhang Zang , Wei Li , Pan Zhang , Xiaoyi Dong , Yuanjun Xiong , Dahua Lin , Jiaqi Wang

CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification

Existing computer vision research in artwork struggles with artwork's fine-grained attributes recognition and lack of curated annotated datasets due to their costly creation. To the best of our knowledge, we are one of the first methods to…

Computer Vision and Pattern Recognition · Computer Science 2022-05-02 Marcos V. Conde , Kerem Turgutlu

SuperCLIP: CLIP with Simple Classification Supervision

Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize…

Computer Vision and Pattern Recognition · Computer Science 2025-12-17 Weiheng Zhao , Zilong Huang , Jiashi Feng , Xinggang Wang