English
Related papers

Related papers: CLIP-Decoder : ZeroShot Multilabel Classification …

200 papers

Multi-modal learning has become increasingly popular due to its ability to leverage information from different data sources (e.g., text and images) to improve the model performance. Recently, CLIP has emerged as an effective approach that…

Machine Learning · Computer Science 2024-07-12 Zixiang Chen , Yihe Deng , Yuanzhi Li , Quanquan Gu

Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate strong zero-shot transfer capability in many discriminative tasks. Their adaptation to zero-shot image-conditioned text generation tasks has drawn increasing interest.…

Computer Vision and Pattern Recognition · Computer Science 2023-03-07 Wei Li , Linchao Zhu , Longyin Wen , Yi Yang

Multimodal multilabel classification (MMC) is a challenging task that aims to design a learning algorithm to handle two data sources, the image and text, and learn a comprehensive semantic feature presentation across the modalities. In this…

Computer Vision and Pattern Recognition · Computer Science 2024-06-25 Yanming Guo

Visual language models like Contrastive Language-Image Pretraining (CLIP) have shown impressive performance in analyzing natural images with language information. However, these models often encounter challenges when applied to specialized…

Computer Vision and Pattern Recognition · Computer Science 2024-12-11 Jiaqing Zhang , Mingxiang Cao , Xue Yang , Kai Jiang , Yunsong Li

Multi-label classification is crucial for comprehensive image understanding, yet acquiring accurate annotations is challenging and costly. To address this, a recent study suggests exploiting unsupervised multi-label classification…

Computer Vision and Pattern Recognition · Computer Science 2025-03-24 Dongseob Kim , Hyunjung Shim

Zero-shot learning (ZSL) aims to recognize unseen classes by leveraging semantic information from seen classes, but most existing methods assume accurate class labels for training instances. However, in real-world scenarios, noise and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-06 Jinfu Fan , Jiangnan Li , Xiaowen Yan , Xiaohui Zhong , Wenpeng Lu , Linqing Huang

This paper presents a CLIP-based unsupervised learning method for annotation-free multi-label image classification, including three stages: initialization, training, and inference. At the initialization stage, we take full advantage of the…

Computer Vision and Pattern Recognition · Computer Science 2024-03-08 Rabab Abdelfattah , Qing Guo , Xiaoguang Li , Xiaofeng Wang , Song Wang

Treating texts as images, combining prompts with textual labels for prompt tuning, and leveraging the alignment properties of CLIP have been successfully applied in zero-shot multi-label image recognition. Nonetheless, relying solely on…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Haonan Xu , Dian Chao , Xiangyu Wu , Zhonghua Wan , Yang Yang

Contrastive Language Image Pre-training (CLIP) has recently demonstrated success across various tasks due to superior feature representation empowered by image-text contrastive learning. However, the instance discrimination method used by…

Computer Vision and Pattern Recognition · Computer Science 2024-11-07 Xiang An , Kaicheng Yang , Xiangzi Dai , Ziyong Feng , Jiankang Deng

We present Distill CLIP (DCLIP), a fine-tuned variant of the CLIP model that enhances multimodal image-text retrieval while preserving the original model's strong zero-shot classification capabilities. CLIP models are typically constrained…

Computer Vision and Pattern Recognition · Computer Science 2025-06-17 Daniel Csizmadia , Andrei Codreanu , Victor Sim , Vighnesh Prabhu , Michael Lu , Kevin Zhu , Sean O'Brien , Vasu Sharma

Contrastive Language-Image Pretraining (CLIP) performs zero-shot image classification by mapping images and textual class representation into a shared embedding space, then retrieving the class closest to the image. This work provides a new…

Computer Vision and Pattern Recognition · Computer Science 2024-12-19 Fawaz Sammani , Nikos Deligiannis

Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations using large-scale image-text pairs. It shows impressive performance on downstream tasks by zero-shot knowledge…

Computer Vision and Pattern Recognition · Computer Science 2022-07-21 Renrui Zhang , Zhang Wei , Rongyao Fang , Peng Gao , Kunchang Li , Jifeng Dai , Yu Qiao , Hongsheng Li

Large-scale vision 2D vision language models, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3D vision models. However, current methods require supervised pre-training for such alignment, and the…

Computer Vision and Pattern Recognition · Computer Science 2024-04-17 Amaya Dharmasiri , Muzammal Naseer , Salman Khan , Fahad Shahbaz Khan

Multi-label recognition with partial labels (MLR-PL), in which only some labels are known while others are unknown for each image, is a practical task in computer vision, since collecting large-scale and complete multi-label datasets is…

Computer Vision and Pattern Recognition · Computer Science 2024-12-17 Haoxian Ruan , Zhihua Xu , Zhijing Yang , Yongyi Lu , Jinghui Qin , Tianshui Chen

Fine-tuning vision-language models (VLMs) like CLIP to downstream tasks is often necessary to optimize their performance. However, a major obstacle is the limited availability of labeled data. We study the use of pseudolabels, i.e.,…

Computer Vision and Pattern Recognition · Computer Science 2024-03-11 Cristina Menghini , Andrew Delworth , Stephen H. Bach

State-of-the-art computer vision models are mostly trained with supervised learning using human-labeled images, which limits their scalability due to the expensive annotation cost. While self-supervised representation learning has achieved…

Computer Vision and Pattern Recognition · Computer Science 2023-03-13 Junnan Li , Silvio Savarese , Steven C. H. Hoi

In recent literature, few-shot classification has predominantly been defined by the N-way k-shot meta-learning problem. Models designed for this purpose are usually trained to excel on standard benchmarks following a restricted setup,…

Computer Vision and Pattern Recognition · Computer Science 2024-05-21 Constance Ferragu , Philomene Chagniot , Vincent Coyette

Zero-shot learning has received increasing interest as a means to alleviate the often prohibitive expense of annotating training data for large scale recognition problems. These methods have achieved great success via learning intermediate…

Machine Learning · Computer Science 2015-03-27 Yanwei Fu , Yongxin Yang , Tim Hospedales , Tao Xiang , Shaogang Gong

Large vision-language representation learning models like CLIP have demonstrated impressive performance for zero-shot transfer to downstream tasks while largely benefiting from inter-modal (image-text) alignment via contrastive objectives.…

Computer Vision and Pattern Recognition · Computer Science 2023-11-16 Muhammad Waleed Gondal , Jochen Gast , Inigo Alonso Ruiz , Richard Droste , Tommaso Macri , Suren Kumar , Luitpold Staudigl

Zero-shot detection (ZSD), i.e., detection on classes not seen during training, is essential for real world detection use-cases, but remains a difficult task. Recent research attempts ZSD with detection models that output embeddings instead…

Computer Vision and Pattern Recognition · Computer Science 2023-06-13 Katharina Kornmeier , Ulla Scheler , Pascal Herrmann
‹ Prev 1 2 3 10 Next ›