Related papers: Modest-Align: Data-Efficient Alignment for Vision-…

SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger

During the preceding biennium, vision-language pre-training has achieved noteworthy success on several downstream tasks. Nevertheless, acquiring high-quality image-text pairs, where the pairs are entirely exclusive of each other, remains a…

Computer Vision and Pattern Recognition · Computer Science 2023-12-19 Yuting Gao , Jinfeng Liu , Zihan Xu , Tong Wu Enwei Zhang , Wei Liu , Jie Yang , Ke Li , Xing Sun

Robust Cross-Modal Representation Learning with Progressive Self-Distillation

The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets, which contributes to its compute and data inefficiency. To…

Computer Vision and Pattern Recognition · Computer Science 2022-04-12 Alex Andonian , Shixing Chen , Raffay Hamid

MirrorAlign: A Super Lightweight Unsupervised Word Alignment Model via Cross-Lingual Contrastive Learning

Word alignment is essential for the downstream cross-lingual language understanding and generation tasks. Recently, the performance of the neural word alignment models has exceeded that of statistical models. However, they heavily rely on…

Computation and Language · Computer Science 2022-05-11 Di Wu , Liang Ding , Shuo Yang , Mingyang Li

Set-CLIP: Exploring Aligned Semantic From Low-Alignment Multimodal Data Through A Distribution View

Multimodal fusion breaks through the boundaries between diverse modalities and has already achieved notable performances. However, in many specialized fields, it is struggling to obtain sufficient alignment data for training, which…

Machine Learning · Computer Science 2024-09-24 Zijia Song , Zelin Zang , Yelin Wang , Guozheng Yang , Kaicheng yu , Wanyu Chen , Miaoyu Wang , Stan Z. Li

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

While large scale pre-training has achieved great achievements in bridging the gap between vision and language, it still faces several challenges. First, the cost for pre-training is expensive. Second, there is no efficient way to handle…

Computation and Language · Computer Science 2021-09-23 Jue Wang , Haofan Wang , Jincan Deng , Weijia Wu , Debing Zhang

Model alignment using inter-modal bridges

Foundation models have demonstrated remarkable performance across modalities such as language and vision. However, model reuse across distinct modalities (e.g., text and vision) remains limited due to the difficulty of aligning internal…

Machine Learning · Computer Science 2025-05-20 Ali Gholamzadeh , Noor Sajid

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

Vision-language pre-training like CLIP has shown promising performance on various downstream tasks such as zero-shot image classification and image-text retrieval. Most of the existing CLIP-alike works usually adopt relatively large image…

Computer Vision and Pattern Recognition · Computer Science 2023-12-04 Ying Nie , Wei He , Kai Han , Yehui Tang , Tianyu Guo , Fanyi Du , Yunhe Wang

CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance

Beyond the success of Contrastive Language-Image Pre-training (CLIP), recent trends mark a shift toward exploring the applicability of lightweight vision-language models for resource-constrained scenarios. These models often deliver…

Computer Vision and Pattern Recognition · Computer Science 2025-03-24 Chu Myaet Thwal , Ye Lin Tun , Minh N. H. Nguyen , Eui-Nam Huh , Choong Seon Hong

Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification

Vision-language models (VLMs) like CLIP are trained with the objective of aligning text and image pairs. To improve CLIP-based few-shot image classification, recent works have observed that, along with text embeddings, image embeddings from…

Computer Vision and Pattern Recognition · Computer Science 2026-03-26 Dipam Goswami , Simone Magistri , Gido M. van de Ven , Bartłomiej Twardowski , Andrew D. Bagdanov , Tinne Tuytelaars , Joost van de Weijer

Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP

Contrastive Language--Image Pre-training (CLIP) has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks. Yet, from a geometrical point of view, the CLIP embedding space has been found to have…

Computer Vision and Pattern Recognition · Computer Science 2024-09-17 Sedigheh Eslami , Gerard de Melo

Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations

Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks. However, several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive…

Computer Vision and Pattern Recognition · Computer Science 2024-11-05 Chenyu You , Yifei Min , Weicheng Dai , Jasjeet S. Sekhon , Lawrence Staib , James S. Duncan

TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks

Lightweight Vision-Language Models (VLMs) are indispensable for resource-constrained applications. The prevailing approach to aligning vision and language models involves freezing both the vision encoder and the language model while…

Machine Learning · Computer Science 2025-07-01 Yuanze Hu , Zhaoxin Fan , Xinyu Wang , Gen Li , Ye Qiu , Zhichao Yang , Wenjun Wu , Kejian Wu , Yifan Sun , Xiaotie Deng , Jin Dong

ComAlign: Compositional Alignment in Vision-Language Models

Vision-language models (VLMs) like CLIP have showcased a remarkable ability to extract transferable features for downstream tasks. Nonetheless, the training process of these models is usually based on a coarse-grained contrastive loss…

Computer Vision and Pattern Recognition · Computer Science 2024-09-13 Ali Abdollah , Amirmohammad Izadi , Armin Saghafian , Reza Vahidimajd , Mohammad Mozafari , Amirreza Mirzaei , Mohammadmahdi Samiei , Mahdieh Soleymani Baghshah

Multi-level Cross-modal Alignment for Image Clustering

Recently, the cross-modal pretraining model has been employed to produce meaningful pseudo-labels to supervise the training of an image clustering model. However, numerous erroneous alignments in a cross-modal pre-training model could…

Computer Vision and Pattern Recognition · Computer Science 2024-01-23 Liping Qiu , Qin Zhang , Xiaojun Chen , Shaotian Cai

Craft: Cross-modal Aligned Features Improve Robustness of Prompt Tuning

Prompt Tuning has emerged as a prominent research paradigm for adapting vision-language models to various downstream tasks. However, recent research indicates that prompt tuning methods often lead to overfitting due to limited training…

Computer Vision and Pattern Recognition · Computer Science 2024-12-23 Jingchen Sun , Rohan Sharma , Vishnu Suresh Lokhande , Changyou Chen

SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

Contrastive Language-Image Pre-training (CLIP)~\citep{radford2021learning} has emerged as a pivotal model in computer vision and multimodal learning, achieving state-of-the-art performance at aligning visual and textual representations…

Computer Vision and Pattern Recognition · Computer Science 2026-04-06 Shaoan Xie , Lingjing Kong , Yujia Zheng , Yu Yao , Zeyu Tang , Eric P. Xing , Guangyi Chen , Kun Zhang

uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data

Contrastive Language-Image Pre-training (CLIP) has demonstrated strong generalization across a wide range of visual tasks by leveraging large-scale English-image pairs. However, its extension to low-resource languages remains limited due to…

Computer Vision and Pattern Recognition · Computer Science 2025-12-09 Dahyun Chung , Donghyun Shin , Yujin Sung , Seunggi Moon , Jinwoo Jeon , Byung-Jun Lee

Enhancing CLIP Robustness via Cross-Modality Alignment

Vision-language models (VLMs) such as CLIP demonstrate strong generalization in zero-shot classification but remain highly vulnerable to adversarial perturbations. Existing methods primarily focus on adversarial fine-tuning or prompt…

Computer Vision and Pattern Recognition · Computer Science 2026-03-02 Xingyu Zhu , Beier Zhu , Shuo Wang , Kesen Zhao , Hanwang Zhang

Linear Alignment of Vision-language Models for Image Captioning

Recently, vision-language models like CLIP have advanced the state of the art in a variety of multi-modal tasks including image captioning and caption evaluation. Many approaches leverage CLIP for cross-modal retrieval to condition…

Computer Vision and Pattern Recognition · Computer Science 2025-02-11 Fabian Paischer , Markus Hofmarcher , Sepp Hochreiter , Thomas Adler

Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion

Pre-trained multi-modal Vision-Language Models like CLIP are widely used off-the-shelf for a variety of applications. In this paper, we show that the common practice of individually exploiting the text or image encoders of these powerful…

Computer Vision and Pattern Recognition · Computer Science 2025-02-07 Marco Mistretta , Alberto Baldrati , Lorenzo Agnolucci , Marco Bertini , Andrew D. Bagdanov