Related papers: CLAP: Learning Transferable Binary Code Representa…

SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training

Contrastive language-audio pretraining (CLAP) has achieved notable success in learning semantically rich audio representations and is widely adopted for various audio-related tasks. However, current CLAP models face several key limitations.…

Audio and Speech Processing · Electrical Eng. & Systems 2026-01-21 Xinhao Mei , Gael Le Lan , Haohe Liu , Zhaoheng Ni , Varun Nagaraja , Yang Liu , Yangyang Shi , Vikas Chandra

CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos

Generalist Vision-Language-Action models are currently hindered by the scarcity of robotic data compared to the abundance of human video demonstrations. Existing Latent Action Models attempt to leverage video data but often suffer from…

Robotics · Computer Science 2026-01-08 Chubin Zhang , Jianan Wang , Zifeng Gao , Yue Su , Tianru Dai , Cai Zhou , Jiwen Lu , Yansong Tang

Improved baselines for vision-language pre-training

Contrastive learning has emerged as an efficient framework to learn multimodal representations. CLIP, a seminal work in this area, achieved impressive results by training on paired image-text data using the contrastive loss. Recent work…

Computer Vision and Pattern Recognition · Computer Science 2023-11-07 Enrico Fini , Pietro Astolfi , Adriana Romero-Soriano , Jakob Verbeek , Michal Drozdzal

CLAP: Learning Audio Concepts From Natural Language Supervision

Mainstream Audio Analytics models are trained to learn under the paradigm of one class label to many recordings focusing on one task. Learning under such restricted supervision limits the flexibility of models because they require labeled…

Sound · Computer Science 2022-06-13 Benjamin Elizalde , Soham Deshmukh , Mahmoud Al Ismail , Huaming Wang

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Contrastive Language-Image Pretraining (CLIP) has emerged as a novel paradigm to learn visual models from language supervision. While researchers continue to push the frontier of CLIP, reproducing these works remains challenging. This is…

Computer Vision and Pattern Recognition · Computer Science 2022-03-14 Yufeng Cui , Lichen Zhao , Feng Liang , Yangguang Li , Jing Shao

SLIP: Self-supervision meets Language-Image Pre-training

Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an exciting new approach to learning with language supervision, demonstrates promising…

Computer Vision and Pattern Recognition · Computer Science 2021-12-24 Norman Mu , Alexander Kirillov , David Wagner , Saining Xie

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and…

Computer Vision and Pattern Recognition · Computer Science 2024-04-25 Sachin Mehta , Maxwell Horton , Fartash Faghri , Mohammad Hossein Sekhavat , Mahyar Najibi , Mehrdad Farajtabar , Oncel Tuzel , Mohammad Rastegari

On the Language Encoder of Contrastive Cross-modal Models

Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component…

Computation and Language · Computer Science 2023-10-23 Mengjie Zhao , Junya Ono , Zhi Zhong , Chieh-Hsin Lai , Yuhta Takida , Naoki Murata , Wei-Hsiang Liao , Takashi Shibuya , Hiromi Wakaki , Yuki Mitsufuji

SuperCLIP: CLIP with Simple Classification Supervision

Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize…

Computer Vision and Pattern Recognition · Computer Science 2025-12-17 Weiheng Zhao , Zilong Huang , Jiashi Feng , Xinggang Wang

Non-Contrastive Learning Meets Language-Image Pre-Training

Contrastive language-image pre-training (CLIP) serves as a de-facto standard to align images and texts. Nonetheless, the loose correlation between images and texts of web-crawled data renders the contrastive objective data inefficient and…

Computer Vision and Pattern Recognition · Computer Science 2022-10-18 Jinghao Zhou , Li Dong , Zhe Gan , Lijuan Wang , Furu Wei

Sparse CLIP: Co-Optimizing Interpretability and Performance in Contrastive Learning

Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in vision-language representation learning, powering diverse downstream tasks and serving as the default vision backbone in multimodal large language models (MLLMs).…

Computer Vision and Pattern Recognition · Computer Science 2026-01-29 Chuan Qin , Constantin Venhoff , Sonia Joseph , Fanyi Xiao , Stefan Scherer

Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP

Multi-modal learning has become increasingly popular due to its ability to leverage information from different data sources (e.g., text and images) to improve the model performance. Recently, CLIP has emerged as an effective approach that…

Machine Learning · Computer Science 2024-07-12 Zixiang Chen , Yihe Deng , Yuanzhi Li , Quanquan Gu

A Closer Look at the Explainability of Contrastive Language-Image Pre-training

Contrastive language-image pre-training (CLIP) is a powerful vision-language model that has shown great benefits for various tasks. However, we have identified some issues with its explainability, which undermine its credibility and limit…

Computer Vision and Pattern Recognition · Computer Science 2024-09-17 Yi Li , Hualiang Wang , Yiqun Duan , Jiheng Zhang , Xiaomeng Li

CLAP4CLIP: Continual Learning with Probabilistic Finetuning for Vision-Language Models

Continual learning (CL) aims to help deep neural networks learn new knowledge while retaining what has been learned. Owing to their powerful generalizability, pre-trained vision-language models such as Contrastive Language-Image…

Computer Vision and Pattern Recognition · Computer Science 2024-11-01 Saurav Jha , Dong Gong , Lina Yao

Contrastive Language-Image Pre-training for the Italian Language

CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal model that jointly learns representations of images and texts. The model is trained on a massive amount of English data and shows impressive performance on…

Computation and Language · Computer Science 2021-08-20 Federico Bianchi , Giuseppe Attanasio , Raphael Pisoni , Silvia Terragni , Gabriele Sarti , Sri Lakshmi

Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning

Contrastive vision-language models (e.g. CLIP) are typically created by updating all the parameters of a vision model and language model through contrastive training. Can such models be created by a small number of parameter updates to an…

Computer Vision and Pattern Recognition · Computer Science 2023-03-22 Zaid Khan , Yun Fu

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. This makes the nature of the training data a significant factor in the efficacy of CLIP for…

Computer Vision and Pattern Recognition · Computer Science 2024-11-06 Maitreya Patel , Abhiram Kusumba , Sheng Cheng , Changhoon Kim , Tejas Gokhale , Chitta Baral , Yezhou Yang

Exploring Visual Interpretability for Contrastive Language-Image Pre-training

Contrastive Language-Image Pre-training (CLIP) learns rich representations via readily available supervision of natural language. It improves the performance of downstream vision tasks, including but not limited to the zero-shot, long tail,…

Computer Vision and Pattern Recognition · Computer Science 2022-11-29 Yi Li , Hualiang Wang , Yiqun Duan , Hang Xu , Xiaomeng Li

Simple Contrastive Representation Adversarial Learning for NLP Tasks

Self-supervised learning approach like contrastive learning is attached great attention in natural language processing. It uses pairs of training data augmentations to build a classification task for an encoder with well representation…

Computation and Language · Computer Science 2021-12-03 Deshui Miao , Jiaqi Zhang , Wenbo Xie , Jian Song , Xin Li , Lijuan Jia , Ning Guo

ParaCLAP -- Towards a general language-audio model for computational paralinguistic tasks

Contrastive language-audio pretraining (CLAP) has recently emerged as a method for making audio analysis more generalisable. Specifically, CLAP-style models are able to `answer' a diverse set of language queries, extending the capabilities…

Sound · Computer Science 2024-06-12 Xin Jing , Andreas Triantafyllopoulos , Björn Schuller