Related papers: CLIP-Decoder : ZeroShot Multilabel Classification …

Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP

Multi-modal learning has become increasingly popular due to its ability to leverage information from different data sources (e.g., text and images) to improve the model performance. Recently, CLIP has emerged as an effective approach that…

Machine Learning · Computer Science 2024-07-12 Zixiang Chen , Yihe Deng , Yuanzhi Li , Quanquan Gu

DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training

Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate strong zero-shot transfer capability in many discriminative tasks. Their adaptation to zero-shot image-conditioned text generation tasks has drawn increasing interest.…

Computer Vision and Pattern Recognition · Computer Science 2023-03-07 Wei Li , Linchao Zhu , Longyin Wen , Yi Yang

Multimodal Multilabel Classification by CLIP

Multimodal multilabel classification (MMC) is a challenging task that aims to design a learning algorithm to handle two data sources, the image and text, and learn a comprehensive semantic feature presentation across the modalities. In this…

Computer Vision and Pattern Recognition · Computer Science 2024-06-25 Yanming Guo

DiffCLIP: Few-shot Language-driven Multimodal Classifier

Visual language models like Contrastive Language-Image Pretraining (CLIP) have shown impressive performance in analyzing natural images with language information. However, these models often encounter challenges when applied to specialized…

Computer Vision and Pattern Recognition · Computer Science 2024-12-11 Jiaqing Zhang , Mingxiang Cao , Xue Yang , Kai Jiang , Yunsong Li

Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification

Multi-label classification is crucial for comprehensive image understanding, yet acquiring accurate annotations is challenging and costly. To address this, a recent study suggests exploiting unsupervised multi-label classification…

Computer Vision and Pattern Recognition · Computer Science 2025-03-24 Dongseob Kim , Hyunjung Shim

CLIP-driven Zero-shot Learning with Ambiguous Labels

Zero-shot learning (ZSL) aims to recognize unseen classes by leveraging semantic information from seen classes, but most existing methods assume accurate class labels for training instances. However, in real-world scenarios, noise and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-06 Jinfu Fan , Jiangnan Li , Xiaowen Yan , Xiaohui Zhong , Wenpeng Lu , Linqing Huang

CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification

This paper presents a CLIP-based unsupervised learning method for annotation-free multi-label image classification, including three stages: initialization, training, and inference. At the initialization stage, we take full advantage of the…

Computer Vision and Pattern Recognition · Computer Science 2024-03-08 Rabab Abdelfattah , Qing Guo , Xiaoguang Li , Xiaofeng Wang , Song Wang

The Solution for Language-Enhanced Image New Category Discovery

Treating texts as images, combining prompts with textual labels for prompt tuning, and leveraging the alignment properties of CLIP have been successfully applied in zero-shot multi-label image recognition. Nonetheless, relying solely on…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Haonan Xu , Dian Chao , Xiangyu Wu , Zhonghua Wan , Yang Yang

Multi-label Cluster Discrimination for Visual Representation Learning

Contrastive Language Image Pre-training (CLIP) has recently demonstrated success across various tasks due to superior feature representation empowered by image-text contrastive learning. However, the instance discrimination method used by…

Computer Vision and Pattern Recognition · Computer Science 2024-11-07 Xiang An , Kaicheng Yang , Xiangzi Dai , Ziyong Feng , Jiankang Deng

Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation

We present Distill CLIP (DCLIP), a fine-tuned variant of the CLIP model that enhances multimodal image-text retrieval while preserving the original model's strong zero-shot classification capabilities. CLIP models are typically constrained…

Computer Vision and Pattern Recognition · Computer Science 2025-06-17 Daniel Csizmadia , Andrei Codreanu , Victor Sim , Vighnesh Prabhu , Michael Lu , Kevin Zhu , Sean O'Brien , Vasu Sharma

Interpreting and Analysing CLIP's Zero-Shot Image Classification via Mutual Knowledge

Contrastive Language-Image Pretraining (CLIP) performs zero-shot image classification by mapping images and textual class representation into a shared embedding space, then retrieving the class closest to the image. This work provides a new…

Computer Vision and Pattern Recognition · Computer Science 2024-12-19 Fawaz Sammani , Nikos Deligiannis

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations using large-scale image-text pairs. It shows impressive performance on downstream tasks by zero-shot knowledge…

Computer Vision and Pattern Recognition · Computer Science 2022-07-21 Renrui Zhang , Zhang Wei , Rongyao Fang , Peng Gao , Kunchang Li , Jifeng Dai , Yu Qiao , Hongsheng Li

Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn Classification without Labels

Large-scale vision 2D vision language models, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3D vision models. However, current methods require supervised pre-training for such alignment, and the…

Computer Vision and Pattern Recognition · Computer Science 2024-04-17 Amaya Dharmasiri , Muzammal Naseer , Salman Khan , Fahad Shahbaz Khan

Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels

Multi-label recognition with partial labels (MLR-PL), in which only some labels are known while others are unknown for each image, is a practical task in computer vision, since collecting large-scale and complete multi-label datasets is…

Computer Vision and Pattern Recognition · Computer Science 2024-12-17 Haoxian Ruan , Zhihua Xu , Zhijing Yang , Yongyi Lu , Jinghui Qin , Tianshui Chen

Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning

Fine-tuning vision-language models (VLMs) like CLIP to downstream tasks is often necessary to optimize their performance. However, a major obstacle is the limited availability of labeled data. We study the use of pseudolabels, i.e.,…

Computer Vision and Pattern Recognition · Computer Science 2024-03-11 Cristina Menghini , Andrew Delworth , Stephen H. Bach

Masked Unsupervised Self-training for Label-free Image Classification

State-of-the-art computer vision models are mostly trained with supervised learning using human-labeled images, which limits their scalability due to the expensive annotation cost. While self-supervised representation learning has achieved…

Computer Vision and Pattern Recognition · Computer Science 2023-03-13 Junnan Li , Silvio Savarese , Steven C. H. Hoi

Multimodal CLIP Inference for Meta-Few-Shot Image Classification

In recent literature, few-shot classification has predominantly been defined by the N-way k-shot meta-learning problem. Models designed for this purpose are usually trained to excel on standard benchmarks following a restricted setup,…

Computer Vision and Pattern Recognition · Computer Science 2024-05-21 Constance Ferragu , Philomene Chagniot , Vincent Coyette

Transductive Multi-label Zero-shot Learning

Zero-shot learning has received increasing interest as a means to alleviate the often prohibitive expense of annotating training data for large scale recognition problems. These methods have achieved great success via learning intermediate…

Machine Learning · Computer Science 2015-03-27 Yanwei Fu , Yongxin Yang , Tim Hospedales , Tao Xiang , Shaogang Gong

Domain Aligned CLIP for Few-shot Classification

Large vision-language representation learning models like CLIP have demonstrated impressive performance for zero-shot transfer to downstream tasks while largely benefiting from inter-modal (image-text) alignment via contrastive objectives.…

Computer Vision and Pattern Recognition · Computer Science 2023-11-16 Muhammad Waleed Gondal , Jochen Gast , Inigo Alonso Ruiz , Richard Droste , Tommaso Macri , Suren Kumar , Luitpold Staudigl

Augmenting Zero-Shot Detection Training with Image Labels

Zero-shot detection (ZSD), i.e., detection on classes not seen during training, is essential for real world detection use-cases, but remains a difficult task. Recent research attempts ZSD with detection models that output embeddings instead…

Computer Vision and Pattern Recognition · Computer Science 2023-06-13 Katharina Kornmeier , Ulla Scheler , Pascal Herrmann