Related papers: Patch-CLIP: A Patch-Text Pre-Trained Model

Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection

The CLIP model has demonstrated significant advancements in aligning visual and language modalities through large-scale pre-training on image-text pairs, enabling strong zero-shot classification and retrieval capabilities on various…

Computer Vision and Pattern Recognition · Computer Science 2025-03-24 Gensheng Pei , Tao Chen , Yujia Wang , Xinhao Cai , Xiangbo Shu , Tianfei Zhou , Yazhou Yao

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Recent progress has shown that large-scale pre-training using contrastive image-text pairs can be a promising alternative for high-quality visual representation learning from natural language supervision. Benefiting from a broader source of…

Computer Vision and Pattern Recognition · Computer Science 2022-03-22 Yongming Rao , Wenliang Zhao , Guangyi Chen , Yansong Tang , Zheng Zhu , Guan Huang , Jie Zhou , Jiwen Lu

FIX-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text

CLIP has shown promising performance across many short-text tasks in a zero-shot manner. However, limited by the input length of the text encoder, CLIP struggles on under-stream tasks with long-text inputs ($>77$ tokens). To remedy this…

Computer Vision and Pattern Recognition · Computer Science 2025-07-30 Bingchao Wang , Zhiwei Ning , Jianyu Ding , Xuanang Gao , Yin Li , Dongsheng Jiang , Jie Yang , Wei Liu

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. This makes the nature of the training data a significant factor in the efficacy of CLIP for…

Computer Vision and Pattern Recognition · Computer Science 2024-11-06 Maitreya Patel , Abhiram Kusumba , Sheng Cheng , Changhoon Kim , Tejas Gokhale , Chitta Baral , Yezhou Yang

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its transferable visual representation learning. However, due to the semantic gap within datasets, CLIP's pre-trained image-text alignment becomes…

Computer Vision and Pattern Recognition · Computer Science 2023-08-11 Longtian Qiu , Renrui Zhang , Ziyu Guo , Ziyao Zeng , Zilu Guo , Yafeng Li , Guangnan Zhang

Improved baselines for vision-language pre-training

Contrastive learning has emerged as an efficient framework to learn multimodal representations. CLIP, a seminal work in this area, achieved impressive results by training on paired image-text data using the contrastive loss. Recent work…

Computer Vision and Pattern Recognition · Computer Science 2023-11-07 Enrico Fini , Pietro Astolfi , Adriana Romero-Soriano , Jakob Verbeek , Michal Drozdzal

MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment

Recent approaches have shown that large-scale vision-language models such as CLIP can improve semantic segmentation performance. These methods typically aim for pixel-level vision-language alignment, but often rely on low resolution image…

Computer Vision and Pattern Recognition · Computer Science 2024-08-01 Anurag Das , Xinting Hu , Li Jiang , Bernt Schiele

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and…

Computer Vision and Pattern Recognition · Computer Science 2024-04-25 Sachin Mehta , Maxwell Horton , Fartash Faghri , Mohammad Hossein Sekhavat , Mahyar Najibi , Mehrdad Farajtabar , Oncel Tuzel , Mohammad Rastegari

Fine-tuning CLIP Text Encoders with Two-step Paraphrasing

Contrastive language-image pre-training (CLIP) models have demonstrated considerable success across various vision-language tasks, such as text-to-image retrieval, where the model is required to effectively process natural language input to…

Computer Vision and Pattern Recognition · Computer Science 2024-02-26 Hyunjae Kim , Seunghyun Yoon , Trung Bui , Handong Zhao , Quan Tran , Franck Dernoncourt , Jaewoo Kang

Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP

Multi-modal learning has become increasingly popular due to its ability to leverage information from different data sources (e.g., text and images) to improve the model performance. Recently, CLIP has emerged as an effective approach that…

Machine Learning · Computer Science 2024-07-12 Zixiang Chen , Yihe Deng , Yuanzhi Li , Quanquan Gu

Role-SynthCLIP: A Role Play Driven Diverse Synthetic Data Approach

The effectiveness of Contrastive Language-Image Pre-training (CLIP) models critically depends on the semantic diversity and quality of their training data. However, while existing synthetic data generation methods primarily focus on…

Computer Vision and Pattern Recognition · Computer Science 2025-11-10 Yuanxiang Huangfu , Chaochao Wang , Weilei Wang

PLIP: Language-Image Pre-training for Person Representation Learning

Language-image pre-training is an effective technique for learning powerful representations in general domains. However, when directly turning to person representation learning, these general pre-training methods suffer from unsatisfactory…

Computer Vision and Pattern Recognition · Computer Science 2024-05-30 Jialong Zuo , Jiahao Hong , Feng Zhang , Changqian Yu , Hanyu Zhou , Changxin Gao , Nong Sang , Jingdong Wang

Automated Description Generation for Software Patches

Software patches are pivotal in refining and evolving codebases, addressing bugs, vulnerabilities, and optimizations. Patch descriptions provide detailed accounts of changes, aiding comprehension and collaboration among developers. However,…

Software Engineering · Computer Science 2024-09-30 Thanh Trong Vu , Tuan-Dung Bui , Thanh-Dat Do , Thu-Trang Nguyen , Hieu Dinh Vo , Son Nguyen

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

This paper presents a simple yet effective framework MaskCLIP, which incorporates a newly proposed masked self-distillation into contrastive language-image pretraining. The core idea of masked self-distillation is to distill representation…

Computer Vision and Pattern Recognition · Computer Science 2023-04-11 Xiaoyi Dong , Jianmin Bao , Yinglin Zheng , Ting Zhang , Dongdong Chen , Hao Yang , Ming Zeng , Weiming Zhang , Lu Yuan , Dong Chen , Fang Wen , Nenghai Yu

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However, these models utilize large transformer-based encoders…

Computer Vision and Pattern Recognition · Computer Science 2024-04-02 Pavan Kumar Anasosalu Vasu , Hadi Pouransari , Fartash Faghri , Raviteja Vemulapalli , Oncel Tuzel

CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter

For video captioning, "pre-training and fine-tuning" has become a de facto paradigm, where ImageNet Pre-training (INP) is usually used to encode the video content, then a task-oriented network is fine-tuned from scratch to cope with caption…

Computer Vision and Pattern Recognition · Computer Science 2022-08-23 Bang Yang , Tong Zhang , Yuexian Zou

AttriPrompt: Dynamic Prompt Composition Learning for CLIP

The evolution of prompt learning methodologies has driven exploration of deeper prompt designs to enhance model performance. However, current deep text prompting approaches suffer from two critical limitations: Over-reliance on constrastive…

Computer Vision and Pattern Recognition · Computer Science 2025-09-09 Qiqi Zhan , Shiwei Li , Qingjie Liu , Yunhong Wang

DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks

In this paper, we introduce DetailCLIP: A Detail-Oriented CLIP to address the limitations of contrastive learning-based vision-language models, particularly CLIP, in handling detail-oriented and fine-grained tasks like segmentation. While…

Computer Vision and Pattern Recognition · Computer Science 2025-04-02 Amin Karimi Monsefi , Kishore Prakash Sailaja , Ali Alilooee , Ser-Nam Lim , Rajiv Ramnath

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Contrastive Language-Image Pretraining (CLIP) has emerged as a novel paradigm to learn visual models from language supervision. While researchers continue to push the frontier of CLIP, reproducing these works remains challenging. This is…

Computer Vision and Pattern Recognition · Computer Science 2022-03-14 Yufeng Cui , Lichen Zhao , Feng Liang , Yangguang Li , Jing Shao

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

Recently, the contrastive language-image pre-training, e.g., CLIP, has demonstrated promising results on various downstream tasks. The pre-trained model can capture enriched visual concepts for images by learning from a large scale of…

Computer Vision and Pattern Recognition · Computer Science 2023-06-21 Huaishao Luo , Junwei Bao , Youzheng Wu , Xiaodong He , Tianrui Li