English
Related papers

Related papers: DiffCLIP: Few-shot Language-driven Multimodal Clas…

200 papers

We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. Differential attention was originally developed for large language models to amplify relevant context while…

Computer Vision and Pattern Recognition · Computer Science 2025-03-11 Hasan Abed Al Kader Hammoud , Bernard Ghanem

We present Distill CLIP (DCLIP), a fine-tuned variant of the CLIP model that enhances multimodal image-text retrieval while preserving the original model's strong zero-shot classification capabilities. CLIP models are typically constrained…

Computer Vision and Pattern Recognition · Computer Science 2025-06-17 Daniel Csizmadia , Andrei Codreanu , Victor Sim , Vighnesh Prabhu , Michael Lu , Kevin Zhu , Sean O'Brien , Vasu Sharma

Vision-language models, such as contrastive language-image pre-training (CLIP), have demonstrated impressive results in natural image domains. However, these models often struggle when applied to specialized domains like remote sensing, and…

Computer Vision and Pattern Recognition · Computer Science 2023-10-26 Sangwoo Mo , Minkyu Kim , Kyungmin Lee , Jinwoo Shin

Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However,…

Computer Vision and Pattern Recognition · Computer Science 2024-05-07 Sitian Shen , Zilin Zhu , Linqian Fan , Harry Zhang , Xinxiao Wu

Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data, such as cross-modal retrieval. The adaptation of CLIP to this specific…

Computer Vision and Pattern Recognition · Computer Science 2024-11-01 João Daniel Silva , Joao Magalhaes , Devis Tuia , Bruno Martins

Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification. The class token in the image encoder is trained to capture the global features to distinguish different text…

Computer Vision and Pattern Recognition · Computer Science 2023-12-21 Yuqi Lin , Minghao Chen , Kaipeng Zhang , Hengjia Li , Mingming Li , Zheng Yang , Dongqin Lv , Binbin Lin , Haifeng Liu , Deng Cai

Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification in both zero-shot and transfer learning settings. However, we show that directly applying such models to recognize…

Computer Vision and Pattern Recognition · Computer Science 2021-12-17 Yiwu Zhong , Jianwei Yang , Pengchuan Zhang , Chunyuan Li , Noel Codella , Liunian Harold Li , Luowei Zhou , Xiyang Dai , Lu Yuan , Yin Li , Jianfeng Gao

Remote sensing applications increasingly rely on deep learning for scene classification. However, their performance is often constrained by the scarcity of labeled data and the high cost of annotation across diverse geographic and sensor…

Computer Vision and Pattern Recognition · Computer Science 2025-10-29 Ivica Dimitrovski , Vlatko Spasev , Ivan Kitanovski

Existing vision-text contrastive learning like CLIP aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical…

Computer Vision and Pattern Recognition · Computer Science 2022-10-20 Zifeng Wang , Zhenbang Wu , Dinesh Agarwal , Jimeng Sun

Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize…

Computer Vision and Pattern Recognition · Computer Science 2025-12-17 Weiheng Zhao , Zilong Huang , Jiashi Feng , Xinggang Wang

The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets, which contributes to its compute and data inefficiency. To…

Computer Vision and Pattern Recognition · Computer Science 2022-04-12 Alex Andonian , Shixing Chen , Raffay Hamid

Weakly Supervised Semantic Segmentation (WSSS) with image-level labels typically leverages Class Activation Maps (CAMs) to achieve pixel-level predictions. Recently, Contrastive Language-Image Pre-training (CLIP) has been introduced to…

Computer Vision and Pattern Recognition · Computer Science 2026-05-07 Zhiwei Yang , Pengfei Song , Yucong Meng , Kexue Fu , Shuo Wang , Zhijian Song

Contrastive Language Image Pre-training (CLIP) has recently demonstrated success across various tasks due to superior feature representation empowered by image-text contrastive learning. However, the instance discrimination method used by…

Computer Vision and Pattern Recognition · Computer Science 2024-11-07 Xiang An , Kaicheng Yang , Xiangzi Dai , Ziyong Feng , Jiankang Deng

CLIP is a seminal multimodal model that maps images and text into a shared representation space through contrastive learning on billions of image-caption pairs. Inspired by the rapid progress of large language models (LLMs), we investigate…

Computer Vision and Pattern Recognition · Computer Science 2026-02-26 Weiquan Huang , Aoqi Wu , Yifan Yang , Xufang Luo , Yuqing Yang , Usman Naseem , Chunyu Wang , Chunyu Wang , Qi Dai , Xiyang Dai , Dongdong Chen , Chong Luo , Lili Qiu , Liang Hu

Photo search, the task of retrieving images based on textual queries, has witnessed significant advancements with the introduction of CLIP (Contrastive Language-Image Pretraining) model. CLIP leverages a vision-language pre training…

Computer Vision and Pattern Recognition · Computer Science 2024-01-25 Naresh Kumar Lahajal , Harini S

We propose a novel framework for few-shot learning by leveraging large-scale vision-language models such as CLIP. Motivated by unimodal prototypical networks for few-shot learning, we introduce Proto-CLIP which utilizes image prototypes and…

Computer Vision and Pattern Recognition · Computer Science 2024-07-16 Jishnu Jaykumar P , Kamalesh Palanisamy , Yu-Wei Chao , Xinya Du , Yu Xiang

Multi-label classification is crucial for comprehensive image understanding, yet acquiring accurate annotations is challenging and costly. To address this, a recent study suggests exploiting unsupervised multi-label classification…

Computer Vision and Pattern Recognition · Computer Science 2025-03-24 Dongseob Kim , Hyunjung Shim

Multimodal fusion breaks through the boundaries between diverse modalities and has already achieved notable performances. However, in many specialized fields, it is struggling to obtain sufficient alignment data for training, which…

Machine Learning · Computer Science 2024-09-24 Zijia Song , Zelin Zang , Yelin Wang , Guozheng Yang , Kaicheng yu , Wanyu Chen , Miaoyu Wang , Stan Z. Li

Large-scale Pre-Training Vision-Language Model such as CLIP has demonstrated outstanding performance in zero-shot classification, e.g. achieving 76.3% top-1 accuracy on ImageNet without seeing any example, which leads to potential benefits…

Computer Vision and Pattern Recognition · Computer Science 2023-12-15 Xuefeng Hu , Ke Zhang , Lu Xia , Albert Chen , Jiajia Luo , Yuyin Sun , Ken Wang , Nan Qiao , Xiao Zeng , Min Sun , Cheng-Hao Kuo , Ram Nevatia

Multi-label classification is an essential task utilized in a wide variety of real-world applications. Multi-label zero-shot learning is a method for classifying images into multiple unseen categories for which no training data is…

Computer Vision and Pattern Recognition · Computer Science 2024-06-24 Muhammad Ali , Salman Khan
‹ Prev 1 2 3 10 Next ›