Related papers: Data-Efficient Language-Supervised Zero-Shot Learn…

Data Efficient Language-supervised Zero-shot Recognition with Optimal Transport Distillation

Traditional computer vision models are trained to predict a fixed set of predefined categories. Recently, natural language has been shown to be a broader and richer source of supervision that provides finer descriptions to visual concepts…

Computer Vision and Pattern Recognition · Computer Science 2023-12-19 Bichen Wu , Ruizhe Cheng , Peizhao Zhang , Tianren Gao , Peter Vajda , Joseph E. Gonzalez

Robust Cross-Modal Representation Learning with Progressive Self-Distillation

The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets, which contributes to its compute and data inefficiency. To…

Computer Vision and Pattern Recognition · Computer Science 2022-04-12 Alex Andonian , Shixing Chen , Raffay Hamid

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite…

Computer Vision and Pattern Recognition · Computer Science 2022-03-15 Yangguang Li , Feng Liang , Lichen Zhao , Yufeng Cui , Wanli Ouyang , Jing Shao , Fengwei Yu , Junjie Yan

Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity

Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving…

Computer Vision and Pattern Recognition · Computer Science 2024-03-21 Siddharth Joshi , Arnav Jain , Ali Payani , Baharan Mirzasoleiman

Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP

Image-text contrastive models like CLIP have wide applications in zero-shot classification, image-text retrieval, and transfer learning. However, they often struggle on compositional visio-linguistic tasks (e.g., attribute-binding or…

Computer Vision and Pattern Recognition · Computer Science 2024-07-02 Samyadeep Basu , Shell Xu Hu , Maziar Sanjabi , Daniela Massiceti , Soheil Feizi

Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation

We present Distill CLIP (DCLIP), a fine-tuned variant of the CLIP model that enhances multimodal image-text retrieval while preserving the original model's strong zero-shot classification capabilities. CLIP models are typically constrained…

Computer Vision and Pattern Recognition · Computer Science 2025-06-17 Daniel Csizmadia , Andrei Codreanu , Victor Sim , Vighnesh Prabhu , Michael Lu , Kevin Zhu , Sean O'Brien , Vasu Sharma

CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination

Contrastive Language-Image Pre-training (CLIP) has achieved excellent performance over a wide range of tasks. However, the effectiveness of CLIP heavily relies on a substantial corpus of pre-training data, resulting in notable consumption…

Computer Vision and Pattern Recognition · Computer Science 2024-12-17 Kaicheng Yang , Tiancheng Gu , Xiang An , Haiqiang Jiang , Xiangzi Dai , Ziyong Feng , Weidong Cai , Jiankang Deng

CLIP-KD: An Empirical Study of CLIP Model Distillation

Contrastive Language-Image Pre-training (CLIP) has become a promising language-supervised visual pre-training framework. This paper aims to distill small CLIP models supervised by a large teacher CLIP model. We propose several distillation…

Computer Vision and Pattern Recognition · Computer Science 2024-05-08 Chuanguang Yang , Zhulin An , Libo Huang , Junyu Bi , Xinqiang Yu , Han Yang , Boyu Diao , Yongjun Xu

SILC: Improving Vision Language Pretraining with Self-Distillation

Image-Text pretraining on web-scale image caption datasets has become the default recipe for open vocabulary classification and retrieval models thanks to the success of CLIP and its variants. Several works have also used CLIP features for…

Computer Vision and Pattern Recognition · Computer Science 2023-12-08 Muhammad Ferjad Naeem , Yongqin Xian , Xiaohua Zhai , Lukas Hoyer , Luc Van Gool , Federico Tombari

CLIP-Embed-KD: Computationally Efficient Knowledge Distillation Using Embeddings as Teachers

Contrastive Language-Image Pre-training (CLIP) has been shown to improve zero-shot generalization capabilities of language and vision models. In this paper, we extend CLIP for efficient knowledge distillation, by utilizing embeddings as…

Machine Learning · Computer Science 2024-09-02 Lakshmi Nair

Zero-Shot Distillation for Image Encoders: How to Make Effective Use of Synthetic Data

Multi-modal foundation models such as CLIP have showcased impressive zero-shot capabilities. However, their applicability in resource-constrained environments is limited due to their large number of parameters and high inference time. While…

Computer Vision and Pattern Recognition · Computer Science 2024-04-26 Niclas Popp , Jan Hendrik Metzen , Matthias Hein

Soft-Label Dataset Distillation and Text Dataset Distillation

Dataset distillation is a method for reducing dataset sizes by learning a small number of synthetic samples containing all the information of a large dataset. This has several benefits like speeding up model training, reducing energy…

Machine Learning · Computer Science 2022-06-10 Ilia Sucholutsky , Matthias Schonlau

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text

Existing vision-text contrastive learning like CLIP aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical…

Computer Vision and Pattern Recognition · Computer Science 2022-10-20 Zifeng Wang , Zhenbang Wu , Dinesh Agarwal , Jimeng Sun

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

Large-scale Pre-Training Vision-Language Model such as CLIP has demonstrated outstanding performance in zero-shot classification, e.g. achieving 76.3% top-1 accuracy on ImageNet without seeing any example, which leads to potential benefits…

Computer Vision and Pattern Recognition · Computer Science 2023-12-15 Xuefeng Hu , Ke Zhang , Lu Xia , Albert Chen , Jiajia Luo , Yuyin Sun , Ken Wang , Nan Qiao , Xiao Zeng , Min Sun , Cheng-Hao Kuo , Ram Nevatia

DiffCLIP: Few-shot Language-driven Multimodal Classifier

Visual language models like Contrastive Language-Image Pretraining (CLIP) have shown impressive performance in analyzing natural images with language information. However, these models often encounter challenges when applied to specialized…

Computer Vision and Pattern Recognition · Computer Science 2024-12-11 Jiaqing Zhang , Mingxiang Cao , Xue Yang , Kai Jiang , Yunsong Li

Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training

Vision-language models trained with contrastive learning on large-scale noisy data are becoming increasingly popular for zero-shot recognition problems. In this paper we improve the following three aspects of the contrastive pre-training…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Filip Radenovic , Abhimanyu Dubey , Abhishek Kadian , Todor Mihaylov , Simon Vandenhende , Yash Patel , Yi Wen , Vignesh Ramanathan , Dhruv Mahajan

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and…

Computer Vision and Pattern Recognition · Computer Science 2024-04-25 Sachin Mehta , Maxwell Horton , Fartash Faghri , Mohammad Hossein Sekhavat , Mahyar Najibi , Mehrdad Farajtabar , Oncel Tuzel , Mohammad Rastegari

S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions

Vision-language models, such as contrastive language-image pre-training (CLIP), have demonstrated impressive results in natural image domains. However, these models often struggle when applied to specialized domains like remote sensing, and…

Computer Vision and Pattern Recognition · Computer Science 2023-10-26 Sangwoo Mo , Minkyu Kim , Kyungmin Lee , Jinwoo Shin

Online Zero-Shot Classification with CLIP

Vision-language pre-training such as CLIP enables zero-shot transfer that can classify images according to the candidate class names. While CLIP demonstrates an impressive zero-shot performance on diverse downstream tasks, the distribution…

Computer Vision and Pattern Recognition · Computer Science 2024-08-27 Qi Qian , Juhua Hu

Enhancing CLIP Conceptual Embedding through Knowledge Distillation

Recently, CLIP has become an important model for aligning images and text in multi-modal contexts. However, researchers have identified limitations in the ability of CLIP's text and image encoders to extract detailed knowledge from pairs of…

Artificial Intelligence · Computer Science 2024-12-10 Kuei-Chun Kao