Related papers: Attribute-based Visual Reprogramming for Vision-La…

Understanding Model Reprogramming for CLIP via Decoupling Visual Prompts

Model reprogramming adapts pretrained models to downstream tasks by modifying only the input and output spaces. Visual reprogramming (VR) is one instance for vision tasks that adds a trainable noise pattern (i.e., a visual prompt) to input…

Machine Learning · Computer Science 2025-06-03 Chengyi Cai , Zesheng Ye , Lei Feng , Jianzhong Qi , Feng Liu

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its transferable visual representation learning. However, due to the semantic gap within datasets, CLIP's pre-trained image-text alignment becomes…

Computer Vision and Pattern Recognition · Computer Science 2023-08-11 Longtian Qiu , Renrui Zhang , Ziyu Guo , Ziyao Zeng , Zilu Guo , Yafeng Li , Guangnan Zhang

DesCLIP: Robust Continual Learning via General Attribute Descriptions for VLM-Based Visual Recognition

Continual learning of vision-language models (VLMs) focuses on leveraging cross-modal pretrained knowledge to incrementally adapt to expanding downstream tasks and datasets, while tackling the challenge of knowledge forgetting. Existing…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Chiyuan He , Zihuan Qiu , Fanman Meng , Linfeng Xu , Qingbo Wu , Hongliang Li

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Large-scale contrastive vision-language pre-training has shown significant progress in visual representation learning. Unlike traditional visual systems trained by a fixed set of discrete labels, a new paradigm was introduced in…

Computer Vision and Pattern Recognition · Computer Science 2025-03-26 Peng Gao , Shijie Geng , Renrui Zhang , Teli Ma , Rongyao Fang , Yongfeng Zhang , Hongsheng Li , Yu Qiao

SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models

Vision-language models such as CLIP are pretrained on large volumes of internet sourced image and text pairs, and have been shown to sometimes exhibit impressive zero- and low-shot image classification performance. However, due to their…

Computer Vision and Pattern Recognition · Computer Science 2022-10-11 Omiros Pantazis , Gabriel Brostow , Kate Jones , Oisin Mac Aodha

AttriPrompt: Dynamic Prompt Composition Learning for CLIP

The evolution of prompt learning methodologies has driven exploration of deeper prompt designs to enhance model performance. However, current deep text prompting approaches suffer from two critical limitations: Over-reliance on constrastive…

Computer Vision and Pattern Recognition · Computer Science 2025-09-09 Qiqi Zhan , Shiwei Li , Qingjie Liu , Yunhong Wang

Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion

Existing pedestrian attribute recognition (PAR) algorithms adopt pre-trained CNN (e.g., ResNet) as their backbone network for visual feature learning, which might obtain sub-optimal results due to the insufficient employment of the…

Computer Vision and Pattern Recognition · Computer Science 2024-09-04 Xiao Wang , Jiandong Jin , Chenglong Li , Jin Tang , Cheng Zhang , Wei Wang

ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling

Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. While large vision language representations like CLIP had largely resolved the task of zero-shot object recognition,…

Computer Vision and Pattern Recognition · Computer Science 2024-10-03 William Yicheng Zhu , Keren Ye , Junjie Ke , Jiahui Yu , Leonidas Guibas , Peyman Milanfar , Feng Yang

Controlling Vision-Language Models for Multi-Task Image Restoration

Vision-language models such as CLIP have shown great impact on diverse downstream tasks for zero-shot or label-free predictions. However, when it comes to low-level vision such as image restoration their performance deteriorates…

Computer Vision and Pattern Recognition · Computer Science 2024-02-29 Ziwei Luo , Fredrik K. Gustafsson , Zheng Zhao , Jens Sjölund , Thomas B. Schön

Multi-modal Attribute Prompting for Vision-Language Models

Pre-trained Vision-Language Models (VLMs), like CLIP, exhibit strong generalization ability to downstream tasks but struggle in few-shot scenarios. Existing prompting techniques primarily focus on global text and image representations, yet…

Computer Vision and Pattern Recognition · Computer Science 2024-07-12 Xin Liu , Jiamin Wu , and Wenfei Yang , Xu Zhou , Tianzhu Zhang

Understanding and Improving Visual Prompting: A Label-Mapping Perspective

We revisit and advance visual prompting (VP), an input prompting technique for vision tasks. VP can reprogram a fixed, pre-trained source model to accomplish downstream tasks in the target domain by simply incorporating universal prompts…

Computer Vision and Pattern Recognition · Computer Science 2023-03-28 Aochuan Chen , Yuguang Yao , Pin-Yu Chen , Yihua Zhang , Sijia Liu

Text Descriptions are Compressive and Invariant Representations for Visual Learning

Modern image classification is based upon directly predicting classes via large discriminative networks, which do not directly contain information about the intuitive visual features that may constitute a classification decision. Recently,…

Computer Vision and Pattern Recognition · Computer Science 2023-10-31 Zhili Feng , Anna Bair , J. Zico Kolter

DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

The parameter-efficient adaptation of the image-text pretraining model CLIP for video-text retrieval is a prominent area of research. While CLIP is focused on image-level vision-language matching, video-text retrieval demands comprehensive…

Computer Vision and Pattern Recognition · Computer Science 2025-06-11 Leqi Shen , Guoqiang Gong , Tianxiang Hao , Tao He , Yifeng Zhang , Pengzhang Liu , Sicheng Zhao , Jungong Han , Guiguang Ding

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

Large-scale Pre-Training Vision-Language Model such as CLIP has demonstrated outstanding performance in zero-shot classification, e.g. achieving 76.3% top-1 accuracy on ImageNet without seeing any example, which leads to potential benefits…

Computer Vision and Pattern Recognition · Computer Science 2023-12-15 Xuefeng Hu , Ke Zhang , Lu Xia , Albert Chen , Jiajia Luo , Yuyin Sun , Ken Wang , Nan Qiao , Xiao Zeng , Min Sun , Cheng-Hao Kuo , Ram Nevatia

FairCLIP: Social Bias Elimination based on Attribute Prototype Learning and Representation Neutralization

The Vision-Language Pre-training (VLP) models like CLIP have gained popularity in recent years. However, many works found that the social biases hidden in CLIP easily manifest in downstream tasks, especially in image retrieval, which can…

Computer Vision and Pattern Recognition · Computer Science 2024-05-31 Junyang Wang , Yi Zhang , Jitao Sang

Rethinking Visual Content Refinement in Low-Shot CLIP Adaptation

Recent adaptations can boost the low-shot capability of Contrastive Vision-Language Pre-training (CLIP) by effectively facilitating knowledge transfer. However, these adaptation methods are usually operated on the global view of an input…

Computer Vision and Pattern Recognition · Computer Science 2024-07-22 Jinda Lu , Shuo Wang , Yanbin Hao , Haifeng Liu , Xiang Wang , Meng Wang

Sample-specific Masks for Visual Reprogramming-based Prompting

Visual reprogramming (VR) is a prompting technique that aims to re-purpose a pre-trained model (e.g., a classifier on ImageNet) to target tasks (e.g., medical data prediction) by learning a small-scale pattern added into input images…

Machine Learning · Computer Science 2024-06-06 Chengyi Cai , Zesheng Ye , Lei Feng , Jianzhong Qi , Feng Liu

Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval

Recently, prompt learning has demonstrated remarkable success in adapting pre-trained Vision-Language Models (VLMs) to various downstream tasks such as image classification. However, its application to the downstream Image-Text Retrieval…

Computer Vision and Pattern Recognition · Computer Science 2025-08-07 Yifan Wang , Tao Wang , Chenwei Tang , Caiyang Yu , Zhengqing Zang , Mengmi Zhang , Shudong Huang , Jiancheng Lv

Bayesian-guided Label Mapping for Visual Reprogramming

Visual reprogramming (VR) leverages the intrinsic capabilities of pretrained vision models by adapting their input or output interfaces to solve downstream tasks whose labels (i.e., downstream labels) might be totally different from the…

Machine Learning · Computer Science 2024-11-01 Chengyi Cai , Zesheng Ye , Lei Feng , Jianzhong Qi , Feng Liu

Finetuning CLIP to Reason about Pairwise Differences

Vision-language models (VLMs) such as CLIP are trained via contrastive learning between text and image pairs, resulting in aligned image and text embeddings that are useful for many downstream tasks. A notable drawback of CLIP, however, is…

Machine Learning · Computer Science 2025-07-08 Dylan Sam , Devin Willmott , Joao D. Semedo , J. Zico Kolter