Related papers: DiffCLIP: Few-shot Language-driven Multimodal Clas…

DiffCLIP: Differential Attention Meets CLIP

We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. Differential attention was originally developed for large language models to amplify relevant context while…

Computer Vision and Pattern Recognition · Computer Science 2025-03-11 Hasan Abed Al Kader Hammoud , Bernard Ghanem

Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation

We present Distill CLIP (DCLIP), a fine-tuned variant of the CLIP model that enhances multimodal image-text retrieval while preserving the original model's strong zero-shot classification capabilities. CLIP models are typically constrained…

Computer Vision and Pattern Recognition · Computer Science 2025-06-17 Daniel Csizmadia , Andrei Codreanu , Victor Sim , Vighnesh Prabhu , Michael Lu , Kevin Zhu , Sean O'Brien , Vasu Sharma

S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions

Vision-language models, such as contrastive language-image pre-training (CLIP), have demonstrated impressive results in natural image domains. However, these models often struggle when applied to specialized domains like remote sensing, and…

Computer Vision and Pattern Recognition · Computer Science 2023-10-26 Sangwoo Mo , Minkyu Kim , Kyungmin Lee , Jinwoo Shin

DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification

Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However,…

Computer Vision and Pattern Recognition · Computer Science 2024-05-07 Sitian Shen , Zilin Zhu , Linqian Fan , Harry Zhang , Xinxiao Wu

Multilingual Vision-Language Pre-training for the Remote Sensing Domain

Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data, such as cross-modal retrieval. The adaptation of CLIP to this specific…

Computer Vision and Pattern Recognition · Computer Science 2024-11-01 João Daniel Silva , Joao Magalhaes , Devis Tuia , Bruno Martins

TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training

Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification. The class token in the image encoder is trained to capture the global features to distinguish different text…

Computer Vision and Pattern Recognition · Computer Science 2023-12-21 Yuqi Lin , Minghao Chen , Kaipeng Zhang , Hengjia Li , Mingming Li , Zheng Yang , Dongqin Lv , Binbin Lin , Haifeng Liu , Deng Cai

RegionCLIP: Region-based Language-Image Pretraining

Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification in both zero-shot and transfer learning settings. However, we show that directly applying such models to recognize…

Computer Vision and Pattern Recognition · Computer Science 2021-12-17 Yiwu Zhong , Jianwei Yang , Pengchuan Zhang , Chunyuan Li , Noel Codella , Liunian Harold Li , Luowei Zhou , Xiyang Dai , Lu Yuan , Yin Li , Jianfeng Gao

Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning

Remote sensing applications increasingly rely on deep learning for scene classification. However, their performance is often constrained by the scarcity of labeled data and the high cost of annotation across diverse geographic and sensor…

Computer Vision and Pattern Recognition · Computer Science 2025-10-29 Ivica Dimitrovski , Vlatko Spasev , Ivan Kitanovski

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text

Existing vision-text contrastive learning like CLIP aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical…

Computer Vision and Pattern Recognition · Computer Science 2022-10-20 Zifeng Wang , Zhenbang Wu , Dinesh Agarwal , Jimeng Sun

SuperCLIP: CLIP with Simple Classification Supervision

Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize…

Computer Vision and Pattern Recognition · Computer Science 2025-12-17 Weiheng Zhao , Zilong Huang , Jiashi Feng , Xinggang Wang

Robust Cross-Modal Representation Learning with Progressive Self-Distillation

The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets, which contributes to its compute and data inefficiency. To…

Computer Vision and Pattern Recognition · Computer Science 2022-04-12 Alex Andonian , Shixing Chen , Raffay Hamid

DiCLIP: Diffusion Model Enhances CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation

Weakly Supervised Semantic Segmentation (WSSS) with image-level labels typically leverages Class Activation Maps (CAMs) to achieve pixel-level predictions. Recently, Contrastive Language-Image Pre-training (CLIP) has been introduced to…

Computer Vision and Pattern Recognition · Computer Science 2026-05-07 Zhiwei Yang , Pengfei Song , Yucong Meng , Kexue Fu , Shuo Wang , Zhijian Song

Multi-label Cluster Discrimination for Visual Representation Learning

Contrastive Language Image Pre-training (CLIP) has recently demonstrated success across various tasks due to superior feature representation empowered by image-text contrastive learning. However, the instance discrimination method used by…

Computer Vision and Pattern Recognition · Computer Science 2024-11-07 Xiang An , Kaicheng Yang , Xiangzi Dai , Ziyong Feng , Jiankang Deng

LLM2CLIP: Powerful Language Model Unlocks Richer Cross-Modality Representation

CLIP is a seminal multimodal model that maps images and text into a shared representation space through contrastive learning on billions of image-caption pairs. Inspired by the rapid progress of large language models (LLMs), we investigate…

Computer Vision and Pattern Recognition · Computer Science 2026-02-26 Weiquan Huang , Aoqi Wu , Yifan Yang , Xufang Luo , Yuqing Yang , Usman Naseem , Chunyu Wang , Chunyu Wang , Qi Dai , Xiyang Dai , Dongdong Chen , Chong Luo , Lili Qiu , Liang Hu

Enhancing Image Retrieval : A Comprehensive Study on Photo Search using the CLIP Mode

Photo search, the task of retrieving images based on textual queries, has witnessed significant advancements with the introduction of CLIP (Contrastive Language-Image Pretraining) model. CLIP leverages a vision-language pre training…

Computer Vision and Pattern Recognition · Computer Science 2024-01-25 Naresh Kumar Lahajal , Harini S

Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning

We propose a novel framework for few-shot learning by leveraging large-scale vision-language models such as CLIP. Motivated by unimodal prototypical networks for few-shot learning, we introduce Proto-CLIP which utilizes image prototypes and…

Computer Vision and Pattern Recognition · Computer Science 2024-07-16 Jishnu Jaykumar P , Kamalesh Palanisamy , Yu-Wei Chao , Xinya Du , Yu Xiang

Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification

Multi-label classification is crucial for comprehensive image understanding, yet acquiring accurate annotations is challenging and costly. To address this, a recent study suggests exploiting unsupervised multi-label classification…

Computer Vision and Pattern Recognition · Computer Science 2025-03-24 Dongseob Kim , Hyunjung Shim

Set-CLIP: Exploring Aligned Semantic From Low-Alignment Multimodal Data Through A Distribution View

Multimodal fusion breaks through the boundaries between diverse modalities and has already achieved notable performances. However, in many specialized fields, it is struggling to obtain sufficient alignment data for training, which…

Machine Learning · Computer Science 2024-09-24 Zijia Song , Zelin Zang , Yelin Wang , Guozheng Yang , Kaicheng yu , Wanyu Chen , Miaoyu Wang , Stan Z. Li

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

Large-scale Pre-Training Vision-Language Model such as CLIP has demonstrated outstanding performance in zero-shot classification, e.g. achieving 76.3% top-1 accuracy on ImageNet without seeing any example, which leads to potential benefits…

Computer Vision and Pattern Recognition · Computer Science 2023-12-15 Xuefeng Hu , Ke Zhang , Lu Xia , Albert Chen , Jiajia Luo , Yuyin Sun , Ken Wang , Nan Qiao , Xiao Zeng , Min Sun , Cheng-Hao Kuo , Ram Nevatia

CLIP-Decoder : ZeroShot Multilabel Classification using Multimodal CLIP Aligned Representation

Multi-label classification is an essential task utilized in a wide variety of real-world applications. Multi-label zero-shot learning is a method for classifying images into multiple unseen categories for which no training data is…

Computer Vision and Pattern Recognition · Computer Science 2024-06-24 Muhammad Ali , Salman Khan