English
Related papers

Related papers: Curriculum Learning for Data-Efficient Vision-Lang…

200 papers

Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and…

Computer Vision and Pattern Recognition · Computer Science 2024-04-25 Sachin Mehta , Maxwell Horton , Fartash Faghri , Mohammad Hossein Sekhavat , Mahyar Najibi , Mehrdad Farajtabar , Oncel Tuzel , Mohammad Rastegari

Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize…

Computer Vision and Pattern Recognition · Computer Science 2025-12-17 Weiheng Zhao , Zilong Huang , Jiashi Feng , Xinggang Wang

Contrastive learning has emerged as an efficient framework to learn multimodal representations. CLIP, a seminal work in this area, achieved impressive results by training on paired image-text data using the contrastive loss. Recent work…

Computer Vision and Pattern Recognition · Computer Science 2023-11-07 Enrico Fini , Pietro Astolfi , Adriana Romero-Soriano , Jakob Verbeek , Michal Drozdzal

In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-training (CLIP) has made significant strides, becoming foundation for various downstream tasks. However, relying on one-to-one (image, text)…

Computer Vision and Pattern Recognition · Computer Science 2024-12-03 Haicheng Wang , Chen Ju , Weixiong Lin , Shuai Xiao , Mengting Chen , Yixuan Huang , Chang Liu , Mingshuai Yao , Jinsong Lan , Ying Chen , Qingwen Liu , Yanfeng Wang

We introduce SuperClass, a super simple classification method for vision-language pre-training on image-text data. Unlike its contrastive counterpart CLIP who contrast with a text encoder, SuperClass directly utilizes tokenized raw text as…

Computer Vision and Pattern Recognition · Computer Science 2024-11-07 Zilong Huang , Qinghao Ye , Bingyi Kang , Jiashi Feng , Haoqi Fan

Image-caption pretraining has been quite successfully used for downstream vision tasks like zero-shot image classification and object detection. However, image-caption pretraining is still a hard problem -- it requires multiple concepts…

Computer Vision and Pattern Recognition · Computer Science 2023-05-31 Hammad A. Ayyubi , Rahul Lokesh , Alireza Zareian , Bo Wu , Shih-Fu Chang

Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, they still struggle on fine-grained entities which are rare, or even absent from…

Computer Vision and Pattern Recognition · Computer Science 2024-02-22 Ahmet Iscen , Mathilde Caron , Alireza Fathi , Cordelia Schmid

Despite the recent success of image-text contrastive models like CLIP and SigLIP, these models often struggle with vision-centric tasks that demand high-fidelity image understanding, such as counting, depth estimation, and fine-grained…

Computer Vision and Pattern Recognition · Computer Science 2025-04-09 Zineng Tang , Long Lian , Seun Eisape , XuDong Wang , Roei Herzig , Adam Yala , Alane Suhr , Trevor Darrell , David M. Chan

Currently, the most dominant approach to establishing language-image alignment is to pre-train text and image encoders jointly through contrastive learning, such as CLIP and its variants. In this work, we question whether such a costly…

Computer Vision and Pattern Recognition · Computer Science 2025-06-05 Jingfeng Yang , Ziyang Wu , Yue Zhao , Yi Ma

Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text comprehension abilities, facilitating advances in several downstream tasks such as zero-shot image classification, image-text retrieval, and text-to-image generation.…

Computer Vision and Pattern Recognition · Computer Science 2024-04-26 Le Zhang , Rabiul Awal , Aishwarya Agrawal

Vision-language models (VLMs) mainly rely on contrastive training to learn general-purpose representations of images and captions. We focus on the situation when one image is associated with several captions, each caption containing both…

Computer Vision and Pattern Recognition · Computer Science 2024-08-02 Maurits Bleeker , Mariya Hendriksen , Andrew Yates , Maarten de Rijke

Visual imagery does not consist of solitary objects, but instead reflects the composition of a multitude of fluid concepts. While there have been great advances in visual representation learning, such advances have focused on building…

Computer Vision and Pattern Recognition · Computer Science 2025-04-07 Austin Stone , Hagen Soltau , Robert Geirhos , Xi Yi , Ye Xia , Bingyi Cao , Kaifeng Chen , Abhijit Ogale , Jonathon Shlens

Vision-language models (VLMs) such as CLIP are trained via contrastive learning between text and image pairs, resulting in aligned image and text embeddings that are useful for many downstream tasks. A notable drawback of CLIP, however, is…

Machine Learning · Computer Science 2025-07-08 Dylan Sam , Devin Willmott , Joao D. Semedo , J. Zico Kolter

Few-Shot learning aims to train and optimize a model that can adapt to unseen visual classes with only a few labeled examples. The existing few-shot learning (FSL) methods, heavily rely only on visual data, thus fail to capture the semantic…

Computer Vision and Pattern Recognition · Computer Science 2022-10-21 Mohamed Afham , Ranga Rodrigo

Vision-language pretraining models have achieved great success in supporting multimedia applications by understanding the alignments between images and text. While existing vision-language pretraining models primarily focus on understanding…

Computer Vision and Pattern Recognition · Computer Science 2024-04-29 Fuxiao Liu , Hao Tan , Chris Tensmeyer

Multimodal models, such as the Contrastive Language-Image Pre-training (CLIP) model, have demonstrated remarkable success in aligning visual and linguistic representations. However, these models exhibit limitations when applied to…

Computer Vision and Pattern Recognition · Computer Science 2026-03-02 Hiroshi Sasaki

Existing vision-text contrastive learning like CLIP aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical…

Computer Vision and Pattern Recognition · Computer Science 2022-10-20 Zifeng Wang , Zhenbang Wu , Dinesh Agarwal , Jimeng Sun

Visual recognition is recently learned via either supervised learning on human-annotated image-label data or language-image contrastive learning with webly-crawled image-text pairs. While supervised learning may result in a more…

Computer Vision and Pattern Recognition · Computer Science 2022-04-08 Jianwei Yang , Chunyuan Li , Pengchuan Zhang , Bin Xiao , Ce Liu , Lu Yuan , Jianfeng Gao

State-of-the-art empirical work has shown that visual representations learned by deep neural networks are robust in nature and capable of performing classification tasks on diverse datasets. For example, CLIP demonstrated zero-shot transfer…

Computer Vision and Pattern Recognition · Computer Science 2023-03-14 Chanda Grover , Indra Deep Mastan , Debayan Gupta

Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning, leading to state-of-the-art models for various downstream multimodal tasks. However, recent research has…

Computation and Language · Computer Science 2023-10-26 Harman Singh , Pengchuan Zhang , Qifan Wang , Mengjiao Wang , Wenhan Xiong , Jingfei Du , Yu Chen
‹ Prev 1 2 3 10 Next ›