Related papers: CtrlSynth: Controllable Image Text Synthesis for D…

RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm

After pre-training on extensive image-text pairs, Contrastive Language-Image Pre-training (CLIP) demonstrates promising performance on a wide variety of benchmarks. However, a substantial volume of multimodal interleaved documents remains…

Computer Vision and Pattern Recognition · Computer Science 2025-08-06 Tiancheng Gu , Kaicheng Yang , Chaoyi Zhang , Yin Xie , Xiang An , Ziyong Feng , Dongnan Liu , Weidong Cai , Jiankang Deng

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. This makes the nature of the training data a significant factor in the efficacy of CLIP for…

Computer Vision and Pattern Recognition · Computer Science 2024-11-06 Maitreya Patel , Abhiram Kusumba , Sheng Cheng , Changhoon Kim , Tejas Gokhale , Chitta Baral , Yezhou Yang

Role-SynthCLIP: A Role Play Driven Diverse Synthetic Data Approach

The effectiveness of Contrastive Language-Image Pre-training (CLIP) models critically depends on the semantic diversity and quality of their training data. However, while existing synthetic data generation methods primarily focus on…

Computer Vision and Pattern Recognition · Computer Science 2025-11-10 Yuanxiang Huangfu , Chaochao Wang , Weilei Wang

ComCLIP: Training-Free Compositional Image and Text Matching

Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for matching images and text. However, it is still challenging to adapt vision-lanaguage pretrained models like CLIP to compositional image and text…

Computer Vision and Pattern Recognition · Computer Science 2024-04-16 Kenan Jiang , Xuehai He , Ruize Xu , Xin Eric Wang

DocSynth: A Layout Guided Approach for Controllable Document Image Synthesis

Despite significant progress on current state-of-the-art image generation models, synthesis of document images containing multiple and complex object layouts is a challenging task. This paper presents a novel approach, called DocSynth, to…

Computer Vision and Pattern Recognition · Computer Science 2021-07-07 Sanket Biswas , Pau Riba , Josep Lladós , Umapada Pal

Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation

We present Distill CLIP (DCLIP), a fine-tuned variant of the CLIP model that enhances multimodal image-text retrieval while preserving the original model's strong zero-shot classification capabilities. CLIP models are typically constrained…

Computer Vision and Pattern Recognition · Computer Science 2025-06-17 Daniel Csizmadia , Andrei Codreanu , Victor Sim , Vighnesh Prabhu , Michael Lu , Kevin Zhu , Sean O'Brien , Vasu Sharma

Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data

Paired image-text data with subtle variations in-between (e.g., people holding surfboards vs. people holding shovels) hold the promise of producing Vision-Language Models with proper compositional understanding. Synthesizing such training…

Computer Vision and Pattern Recognition · Computer Science 2025-04-01 Haoxin Li , Boyang Li

Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis

Recent advances in multimodal learning have achieved remarkable success across diverse vision-language tasks. However, such progress heavily relies on large-scale image-text datasets, making training costly and inefficient. Prior efforts in…

Computer Vision and Pattern Recognition · Computer Science 2026-03-02 Junhyeok Choi , Sangwoo Mo , Minwoo Chae

More Control for Free! Image Synthesis with Semantic Diffusion Guidance

Controllable image synthesis models allow creation of diverse images based on text instructions or guidance from a reference image. Recently, denoising diffusion probabilistic models have been shown to generate more realistic imagery than…

Computer Vision and Pattern Recognition · Computer Science 2022-12-06 Xihui Liu , Dong Huk Park , Samaneh Azadi , Gong Zhang , Arman Chopikyan , Yuxiao Hu , Humphrey Shi , Anna Rohrbach , Trevor Darrell

Can Synthetic Images Serve as Effective and Efficient Class Prototypes?

Vision-Language Models (VLMs) have shown strong performance in zero-shot image classification tasks. However, existing methods, including Contrastive Language-Image Pre-training (CLIP), all rely on annotated text-to-image pairs for aligning…

Computer Vision and Pattern Recognition · Computer Science 2026-01-22 Dianxing Shi , Dingjie Fu , Yuqiao Liu , Jun Wang

CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment

Large-scale natural image-text datasets, especially those automatically collected from the web, often suffer from loose semantic alignment due to weak supervision, while medical datasets tend to have high cross-modal correlation but low…

Computer Vision and Pattern Recognition · Computer Science 2025-09-26 Shengzhu Yang , Jiawei Du , Shuai Lu , Weihang Zhang , Ningli Wang , Huiqi Li

Robust Cross-Modal Representation Learning with Progressive Self-Distillation

The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets, which contributes to its compute and data inefficiency. To…

Computer Vision and Pattern Recognition · Computer Science 2022-04-12 Alex Andonian , Shixing Chen , Raffay Hamid

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

Previous works show that noisy, web-crawled image-text pairs may limit vision-language pretraining like CLIP and propose learning with synthetic captions as a promising alternative. Our work continues this effort, introducing two simple yet…

Computer Vision and Pattern Recognition · Computer Science 2024-11-27 Yanqing Liu , Xianhang Li , Zeyu Wang , Bingchen Zhao , Cihang Xie

Multilingual Vision-Language Pre-training for the Remote Sensing Domain

Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data, such as cross-modal retrieval. The adaptation of CLIP to this specific…

Computer Vision and Pattern Recognition · Computer Science 2024-11-01 João Daniel Silva , Joao Magalhaes , Devis Tuia , Bruno Martins

CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions

Pretrained vision-language models (VLMs) such as CLIP excel in general multimodal comprehension but often struggle to capture nuanced, context-dependent visual cues. This makes it difficult to distinguish between similar-looking concepts…

Computer Vision and Pattern Recognition · Computer Science 2025-07-17 Yuchen Huang , Zhiyuan Fan , Zhitao He , Sandeep Polisetty , Wenyan Li , Yi R. Fung

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite…

Computer Vision and Pattern Recognition · Computer Science 2022-03-15 Yangguang Li , Feng Liang , Lichen Zhao , Yufeng Cui , Wanli Ouyang , Jing Shao , Fengwei Yu , Junjie Yan

SYNC-CLIP: Synthetic Data Make CLIP Generalize Better in Data-Limited Scenarios

Prompt learning is a powerful technique for transferring Vision-Language Models (VLMs) such as CLIP to downstream tasks. However, the prompt-based methods that are fine-tuned solely with base classes may struggle to generalize to novel…

Computer Vision and Pattern Recognition · Computer Science 2023-12-08 Mushui Liu , Weijie He , Ziqian Lu , Yunlong Yu

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However, these models utilize large transformer-based encoders…

Computer Vision and Pattern Recognition · Computer Science 2024-04-02 Pavan Kumar Anasosalu Vasu , Hadi Pouransari , Fartash Faghri , Raviteja Vemulapalli , Oncel Tuzel

StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis

Generating images that fit a given text description using machine learning has improved greatly with the release of technologies such as the CLIP image-text encoder model; however, current methods lack artistic control of the style of image…

Computer Vision and Pattern Recognition · Computer Science 2022-03-02 Peter Schaldenbrand , Zhixuan Liu , Jean Oh

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-training (CLIP) has made significant strides, becoming foundation for various downstream tasks. However, relying on one-to-one (image, text)…

Computer Vision and Pattern Recognition · Computer Science 2024-12-03 Haicheng Wang , Chen Ju , Weixiong Lin , Shuai Xiao , Mengting Chen , Yixuan Huang , Chang Liu , Mingshuai Yao , Jinsong Lan , Ying Chen , Qingwen Liu , Yanfeng Wang