English
Related papers

Related papers: Conditional Prompt Learning for Vision-Language Mo…

200 papers

Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based…

Computer Vision and Pattern Recognition · Computer Science 2022-10-07 Kaiyang Zhou , Jingkang Yang , Chen Change Loy , Ziwei Liu

Large pre-trained vision-language models like CLIP have transformed computer vision by aligning images and text in a shared feature space, enabling robust zero-shot transfer via prompting. Soft-prompting, such as Context Optimization…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Arsham Gholamzadeh Khoee , Yinan Yu , Robert Feldt

Prompt tuning is an effective way to adapt the pre-trained visual-language model (VLM) to the downstream task using task-related textual tokens. Representative CoOp-based work combines the learnable textual tokens with the class tokens to…

Computer Vision and Pattern Recognition · Computer Science 2023-03-24 Hantao Yao , Rui Zhang , Changsheng Xu

Context Optimization (CoOp) has emerged as a simple yet effective technique for adapting CLIP-like vision-language models to downstream image recognition tasks. Nevertheless, learning compact context with satisfactory base-to-new, domain…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Kun Ding , Xiaohui Li , Qiang Yu , Ying Wang , Haojian Zhang , Shiming Xiang

Prompt learning has become one of the most efficient paradigms for adapting large pre-trained vision-language models to downstream tasks. Current state-of-the-art methods, like CoOp and ProDA, tend to adopt soft prompts to learn an…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Sifan Long , Zhen Zhao , Junkun Yuan , Zichang Tan , Jiangjiang Liu , Luping Zhou , Shengsheng Wang , Jingdong Wang

Vision-language pre-trained models (VLMs) such as CLIP have demonstrated remarkable zero-shot generalization, and prompt learning has emerged as an efficient alternative to full fine-tuning. However, existing methods often struggle with…

Computer Vision and Pattern Recognition · Computer Science 2025-07-30 Zhaolong Wang , Tongfeng Sun , Mingzheng Du , Yachao Huang

Pretrained vision-language models (VLMs) such as CLIP have shown impressive generalization capability in downstream vision tasks with appropriate text prompts. Instead of designing prompts manually, Context Optimization (CoOp) has been…

Computer Vision and Pattern Recognition · Computer Science 2023-02-15 Chengcheng Ma , Yang Liu , Jiankang Deng , Lingxi Xie , Weiming Dong , Changsheng Xu

Recent advancements in vision-language models (VLMs), such as CLIP, have demonstrated substantial success in self-supervised representation learning for vision tasks. However, effectively adapting VLMs to downstream applications remains…

Computer Vision and Pattern Recognition · Computer Science 2025-03-13 Taha Koleilat , Hojat Asgariandehkordi , Hassan Rivaz , Yiming Xiao

Recent advances in large pre-trained vision-language models have demonstrated remarkable performance on zero-shot downstream tasks. Building upon this, recent studies, such as CoOp and CoCoOp, have proposed the use of prompt learning, where…

Computer Vision and Pattern Recognition · Computer Science 2024-04-26 Gahyeon Kim , Sohee Kim , Seokju Lee

Recent advances in multimodal learning has resulted in powerful vision-language models, whose representations are generalizable across a variety of downstream tasks. Recently, their generalization ability has been further extended by…

Computer Vision and Pattern Recognition · Computer Science 2023-12-13 Koustava Goswami , Srikrishna Karanam , Prateksha Udhayanan , K J Joseph , Balaji Vasan Srinivasan

Contrastive Language-Image Pretraining (CLIP) model has exhibited remarkable efficacy in establishing cross-modal connections between texts and images, yielding impressive performance across a broad spectrum of downstream applications…

Computer Vision and Pattern Recognition · Computer Science 2024-01-17 Yi Zhang , Ce Zhang , Ke Yu , Yushun Tang , Zhihai He

Vision-language models (VLMs) like CLIP excel in zero-shot learning but often require resource-intensive training to adapt to new tasks. Prompt learning techniques, such as CoOp and CoCoOp, offer efficient adaptation but tend to overfit to…

Computer Vision and Pattern Recognition · Computer Science 2025-08-08 Phuoc-Nguyen Bui , Khanh-Binh Nguyen , Hyunseung Choo

Vision-Language Models (VLMs), such as CLIP, play a foundational role in various cross-modal applications. To fully leverage VLMs' potential in adapting to downstream tasks, context optimization methods like Prompt Tuning are essential.…

Computer Vision and Pattern Recognition · Computer Science 2025-08-19 Enming Zhang , Bingke Zhu , Yingying Chen , Qinghai Miao , Ming Tang , Jinqiao Wang

Recent advances in pre-training vision-language models (VLMs), e.g., contrastive language-image pre-training (CLIP) methods, have shown great potential in learning out-of-distribution (OOD) representations. Despite showing competitive…

Computer Vision and Pattern Recognition · Computer Science 2025-09-22 Min Zhang , Bo Jiang , Jie Zhou , Yimeng Liu , Xin Lin

So far, efficient fine-tuning has become a popular strategy for enhancing the capabilities of foundation models on downstream tasks by learning plug-and-play modules. However, existing methods overlook a crucial issue: if the underlying…

Computer Vision and Pattern Recognition · Computer Science 2024-12-31 Zhengbo Wang , Jian Liang , Lijun Sheng , Ran He , Zilei Wang , Tieniu Tan

Pre-trained vision-language models, e.g., CLIP, working with manually designed prompts have demonstrated great capacity of transfer learning. Recently, learnable prompts achieve state-of-the-art performance, which however are prone to…

Computer Vision and Pattern Recognition · Computer Science 2023-08-23 Baoshuo Kan , Teng Wang , Wenpeng Lu , Xiantong Zhen , Weili Guan , Feng Zheng

Large pre-trained vision-language models such as CLIP have demonstrated great potential in zero-shot transferability to downstream tasks. However, to attain optimal performance, the manual selection of prompts is necessary to improve…

Computer Vision and Pattern Recognition · Computer Science 2024-09-17 Thi Minh Anh Pham , An Duc Nguyen , Cephas Svosve , Vasileios Argyriou , Georgios Tzimiropoulos

Prompt learning is one of the most effective and trending ways to adapt powerful vision-language foundation models like CLIP to downstream datasets by tuning learnable prompt vectors with very few samples. However, although prompt learning…

Computer Vision and Pattern Recognition · Computer Science 2023-04-03 Cairong Zhao , Yubin Wang , Xinyang Jiang , Yifei Shen , Kaitao Song , Dongsheng Li , Duoqian Miao

We propose Consistency-guided Prompt learning (CoPrompt), a new fine-tuning method for vision-language models. Our approach improves the generalization of large foundation models when fine-tuned on downstream tasks in a few-shot setting.…

Computer Vision and Pattern Recognition · Computer Science 2024-08-06 Shuvendu Roy , Ali Etemad

Large-scale contrastive vision-language pre-training has shown significant progress in visual representation learning. Unlike traditional visual systems trained by a fixed set of discrete labels, a new paradigm was introduced in…

Computer Vision and Pattern Recognition · Computer Science 2025-03-26 Peng Gao , Shijie Geng , Renrui Zhang , Teli Ma , Rongyao Fang , Yongfeng Zhang , Hongsheng Li , Yu Qiao
‹ Prev 1 2 3 10 Next ›