Related papers: Demystifying CLIP Data

Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity

Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving…

Computer Vision and Pattern Recognition · Computer Science 2024-03-21 Siddharth Joshi , Arnav Jain , Ali Payani , Baharan Mirzasoleiman

Meta CLIP 2: A Worldwide Scaling Recipe

Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on…

Computer Vision and Pattern Recognition · Computer Science 2025-08-04 Yung-Sung Chuang , Yang Li , Dong Wang , Ching-Feng Yeh , Kehan Lyu , Ramya Raghavendra , James Glass , Lifei Huang , Jason Weston , Luke Zettlemoyer , Xinlei Chen , Zhuang Liu , Saining Xie , Wen-tau Yih , Shang-Wen Li , Hu Xu

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite…

Computer Vision and Pattern Recognition · Computer Science 2022-03-15 Yangguang Li , Feng Liang , Lichen Zhao , Yufeng Cui , Wanli Ouyang , Jing Shao , Fengwei Yu , Junjie Yan

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Contrastive Language-Image Pretraining (CLIP) has emerged as a novel paradigm to learn visual models from language supervision. While researchers continue to push the frontier of CLIP, reproducing these works remains challenging. This is…

Computer Vision and Pattern Recognition · Computer Science 2022-03-14 Yufeng Cui , Lichen Zhao , Feng Liang , Yangguang Li , Jing Shao

Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data

Contrastive Language-Image Pre-training (CLIP) has become the standard for cross-modal image-text representation learning. Improving CLIP typically requires additional data and retraining with new loss functions, but these demands raise…

Computer Vision and Pattern Recognition · Computer Science 2025-02-10 Haonan Wang , Minbin Huang , Runhui Huang , Lanqing Hong , Hang Xu , Tianyang Hu , Xiaodan Liang , Zhenguo Li , Hong Cheng , Kenji Kawaguchi

Improving CLIP Training with Language Rewrites

Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective and scalable methods for training transferable vision models using paired image and text data. CLIP models are trained using contrastive loss, which…

Computer Vision and Pattern Recognition · Computer Science 2023-10-31 Lijie Fan , Dilip Krishnan , Phillip Isola , Dina Katabi , Yonglong Tian

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Contrastive language-image pre-training, CLIP for short, has gained increasing attention for its potential in various scenarios. In this paper, we propose EVA-CLIP, a series of models that significantly improve the efficiency and…

Computer Vision and Pattern Recognition · Computer Science 2023-03-28 Quan Sun , Yuxin Fang , Ledell Wu , Xinlong Wang , Yue Cao

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text

Existing vision-text contrastive learning like CLIP aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical…

Computer Vision and Pattern Recognition · Computer Science 2022-10-20 Zifeng Wang , Zhenbang Wu , Dinesh Agarwal , Jimeng Sun

MoDE: CLIP Data Experts via Clustering

The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system…

Computer Vision and Pattern Recognition · Computer Science 2024-04-25 Jiawei Ma , Po-Yao Huang , Saining Xie , Shang-Wen Li , Luke Zettlemoyer , Shih-Fu Chang , Wen-Tau Yih , Hu Xu

Enhancing Image Retrieval : A Comprehensive Study on Photo Search using the CLIP Mode

Photo search, the task of retrieving images based on textual queries, has witnessed significant advancements with the introduction of CLIP (Contrastive Language-Image Pretraining) model. CLIP leverages a vision-language pre training…

Computer Vision and Pattern Recognition · Computer Science 2024-01-25 Naresh Kumar Lahajal , Harini S

Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies

This paper investigates the performance of the Contrastive Language-Image Pre-training (CLIP) when scaled down to limited computation budgets. We explore CLIP along three dimensions: data, architecture, and training strategies. With regards…

Computer Vision and Pattern Recognition · Computer Science 2024-04-17 Zichao Li , Cihang Xie , Ekin Dogus Cubuk

CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification

Existing computer vision research in artwork struggles with artwork's fine-grained attributes recognition and lack of curated annotated datasets due to their costly creation. To the best of our knowledge, we are one of the first methods to…

Computer Vision and Pattern Recognition · Computer Science 2022-05-02 Marcos V. Conde , Kerem Turgutlu

CLIP in Medical Imaging: A Survey

Contrastive Language-Image Pre-training (CLIP), a simple yet effective pre-training paradigm, successfully introduces text supervision to vision models. It has shown promising results across various tasks due to its generalizability and…

Computer Vision and Pattern Recognition · Computer Science 2025-03-27 Zihao Zhao , Yuxiao Liu , Han Wu , Mei Wang , Yonghao Li , Sheng Wang , Lin Teng , Disheng Liu , Zhiming Cui , Qian Wang , Dinggang Shen

Improved baselines for vision-language pre-training

Contrastive learning has emerged as an efficient framework to learn multimodal representations. CLIP, a seminal work in this area, achieved impressive results by training on paired image-text data using the contrastive loss. Recent work…

Computer Vision and Pattern Recognition · Computer Science 2023-11-07 Enrico Fini , Pietro Astolfi , Adriana Romero-Soriano , Jakob Verbeek , Michal Drozdzal

Non-Contrastive Learning Meets Language-Image Pre-Training

Contrastive language-image pre-training (CLIP) serves as a de-facto standard to align images and texts. Nonetheless, the loose correlation between images and texts of web-crawled data renders the contrastive objective data inefficient and…

Computer Vision and Pattern Recognition · Computer Science 2022-10-18 Jinghao Zhou , Li Dong , Zhe Gan , Lijuan Wang , Furu Wei

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. This makes the nature of the training data a significant factor in the efficacy of CLIP for…

Computer Vision and Pattern Recognition · Computer Science 2024-11-06 Maitreya Patel , Abhiram Kusumba , Sheng Cheng , Changhoon Kim , Tejas Gokhale , Chitta Baral , Yezhou Yang

SuperCLIP: CLIP with Simple Classification Supervision

Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize…

Computer Vision and Pattern Recognition · Computer Science 2025-12-17 Weiheng Zhao , Zilong Huang , Jiashi Feng , Xinggang Wang

ComCLIP: Training-Free Compositional Image and Text Matching

Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for matching images and text. However, it is still challenging to adapt vision-lanaguage pretrained models like CLIP to compositional image and text…

Computer Vision and Pattern Recognition · Computer Science 2024-04-16 Kenan Jiang , Xuehai He , Ruize Xu , Xin Eric Wang

Training CLIP models on Data from Scientific Papers

Contrastive Language-Image Pretraining (CLIP) models are able to capture the semantic relationship of images and texts and have enabled a wide range of applications, from image retrieval to classification. These models are trained with…

Computer Vision and Pattern Recognition · Computer Science 2023-11-09 Calvin Metzger

Contrastive Language-Image Pre-training for the Italian Language

CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal model that jointly learns representations of images and texts. The model is trained on a massive amount of English data and shows impressive performance on…

Computation and Language · Computer Science 2021-08-20 Federico Bianchi , Giuseppe Attanasio , Raphael Pisoni , Silvia Terragni , Gabriele Sarti , Sri Lakshmi