Related papers: Curriculum Learning for Data-Efficient Vision-Lang…

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and…

Computer Vision and Pattern Recognition · Computer Science 2024-04-25 Sachin Mehta , Maxwell Horton , Fartash Faghri , Mohammad Hossein Sekhavat , Mahyar Najibi , Mehrdad Farajtabar , Oncel Tuzel , Mohammad Rastegari

SuperCLIP: CLIP with Simple Classification Supervision

Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize…

Computer Vision and Pattern Recognition · Computer Science 2025-12-17 Weiheng Zhao , Zilong Huang , Jiashi Feng , Xinggang Wang

Improved baselines for vision-language pre-training

Contrastive learning has emerged as an efficient framework to learn multimodal representations. CLIP, a seminal work in this area, achieved impressive results by training on paired image-text data using the contrastive loss. Recent work…

Computer Vision and Pattern Recognition · Computer Science 2023-11-07 Enrico Fini , Pietro Astolfi , Adriana Romero-Soriano , Jakob Verbeek , Michal Drozdzal

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-training (CLIP) has made significant strides, becoming foundation for various downstream tasks. However, relying on one-to-one (image, text)…

Computer Vision and Pattern Recognition · Computer Science 2024-12-03 Haicheng Wang , Chen Ju , Weixiong Lin , Shuai Xiao , Mengting Chen , Yixuan Huang , Chang Liu , Mingshuai Yao , Jinsong Lan , Ying Chen , Qingwen Liu , Yanfeng Wang

Classification Done Right for Vision-Language Pre-Training

We introduce SuperClass, a super simple classification method for vision-language pre-training on image-text data. Unlike its contrastive counterpart CLIP who contrast with a text encoder, SuperClass directly utilizes tokenized raw text as…

Computer Vision and Pattern Recognition · Computer Science 2024-11-07 Zilong Huang , Qinghao Ye , Bingyi Kang , Jiashi Feng , Haoqi Fan

Learning from Children: Improving Image-Caption Pretraining via Curriculum

Image-caption pretraining has been quite successfully used for downstream vision tasks like zero-shot image classification and object detection. However, image-caption pretraining is still a hard problem -- it requires multiple concepts…

Computer Vision and Pattern Recognition · Computer Science 2023-05-31 Hammad A. Ayyubi , Rahul Lokesh , Alireza Zareian , Bo Wu , Shih-Fu Chang

Retrieval-Enhanced Contrastive Vision-Text Models

Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, they still struggle on fine-grained entities which are rare, or even absent from…

Computer Vision and Pattern Recognition · Computer Science 2024-02-22 Ahmet Iscen , Mathilde Caron , Alireza Fathi , Cordelia Schmid

TULIP: Towards Unified Language-Image Pretraining

Despite the recent success of image-text contrastive models like CLIP and SigLIP, these models often struggle with vision-centric tasks that demand high-fidelity image understanding, such as counting, depth estimation, and fine-grained…

Computer Vision and Pattern Recognition · Computer Science 2025-04-09 Zineng Tang , Long Lian , Seun Eisape , XuDong Wang , Roei Herzig , Adam Yala , Alane Suhr , Trevor Darrell , David M. Chan

Language-Image Alignment with Fixed Text Encoders

Currently, the most dominant approach to establishing language-image alignment is to pre-train text and image encoders jointly through contrastive learning, such as CLIP and its variants. In this work, we question whether such a costly…

Computer Vision and Pattern Recognition · Computer Science 2025-06-05 Jingfeng Yang , Ziyang Wu , Yue Zhao , Yi Ma

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text comprehension abilities, facilitating advances in several downstream tasks such as zero-shot image classification, image-text retrieval, and text-to-image generation.…

Computer Vision and Pattern Recognition · Computer Science 2024-04-26 Le Zhang , Rabiul Awal , Aishwarya Agrawal

Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning

Vision-language models (VLMs) mainly rely on contrastive training to learn general-purpose representations of images and captions. We focus on the situation when one image is associated with several captions, each caption containing both…

Computer Vision and Pattern Recognition · Computer Science 2024-08-02 Maurits Bleeker , Mariya Hendriksen , Andrew Yates , Maarten de Rijke

Learning Visual Composition through Improved Semantic Guidance

Visual imagery does not consist of solitary objects, but instead reflects the composition of a multitude of fluid concepts. While there have been great advances in visual representation learning, such advances have focused on building…

Computer Vision and Pattern Recognition · Computer Science 2025-04-07 Austin Stone , Hagen Soltau , Robert Geirhos , Xi Yi , Ye Xia , Bingyi Cao , Kaifeng Chen , Abhijit Ogale , Jonathon Shlens

Finetuning CLIP to Reason about Pairwise Differences

Vision-language models (VLMs) such as CLIP are trained via contrastive learning between text and image pairs, resulting in aligned image and text embeddings that are useful for many downstream tasks. A notable drawback of CLIP, however, is…

Machine Learning · Computer Science 2025-07-08 Dylan Sam , Devin Willmott , Joao D. Semedo , J. Zico Kolter

Visual-Semantic Contrastive Alignment for Few-Shot Image Classification

Few-Shot learning aims to train and optimize a model that can adapt to unseen visual classes with only a few labeled examples. The existing few-shot learning (FSL) methods, heavily rely only on visual data, thus fail to capture the semantic…

Computer Vision and Pattern Recognition · Computer Science 2022-10-21 Mohamed Afham , Ranga Rodrigo

DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents

Vision-language pretraining models have achieved great success in supporting multimedia applications by understanding the alignments between images and text. While existing vision-language pretraining models primarily focus on understanding…

Computer Vision and Pattern Recognition · Computer Science 2024-04-29 Fuxiao Liu , Hao Tan , Chris Tensmeyer

Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models

Multimodal models, such as the Contrastive Language-Image Pre-training (CLIP) model, have demonstrated remarkable success in aligning visual and linguistic representations. However, these models exhibit limitations when applied to…

Computer Vision and Pattern Recognition · Computer Science 2026-03-02 Hiroshi Sasaki

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text

Existing vision-text contrastive learning like CLIP aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical…

Computer Vision and Pattern Recognition · Computer Science 2022-10-20 Zifeng Wang , Zhenbang Wu , Dinesh Agarwal , Jimeng Sun

Unified Contrastive Learning in Image-Text-Label Space

Visual recognition is recently learned via either supervised learning on human-annotated image-label data or language-image contrastive learning with webly-crawled image-text pairs. While supervised learning may result in a more…

Computer Vision and Pattern Recognition · Computer Science 2022-04-08 Jianwei Yang , Chunyuan Li , Pengchuan Zhang , Bin Xiao , Ce Liu , Lu Yuan , Jianfeng Gao

ContextCLIP: Contextual Alignment of Image-Text pairs on CLIP visual representations

State-of-the-art empirical work has shown that visual representations learned by deep neural networks are robust in nature and capable of performing classification tasks on diverse datasets. For example, CLIP demonstrated zero-shot transfer…

Computer Vision and Pattern Recognition · Computer Science 2023-03-14 Chanda Grover , Indra Deep Mastan , Debayan Gupta

Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality

Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning, leading to state-of-the-art models for various downstream multimodal tasks. However, recent research has…

Computation and Language · Computer Science 2023-10-26 Harman Singh , Pengchuan Zhang , Qifan Wang , Mengjiao Wang , Wenhan Xiong , Jingfei Du , Yu Chen