English
Related papers

Related papers: CLAP: Learning Transferable Binary Code Representa…

200 papers

Contrastive language-audio pretraining (CLAP) has achieved notable success in learning semantically rich audio representations and is widely adopted for various audio-related tasks. However, current CLAP models face several key limitations.…

Audio and Speech Processing · Electrical Eng. & Systems 2026-01-21 Xinhao Mei , Gael Le Lan , Haohe Liu , Zhaoheng Ni , Varun Nagaraja , Yang Liu , Yangyang Shi , Vikas Chandra

Generalist Vision-Language-Action models are currently hindered by the scarcity of robotic data compared to the abundance of human video demonstrations. Existing Latent Action Models attempt to leverage video data but often suffer from…

Robotics · Computer Science 2026-01-08 Chubin Zhang , Jianan Wang , Zifeng Gao , Yue Su , Tianru Dai , Cai Zhou , Jiwen Lu , Yansong Tang

Contrastive learning has emerged as an efficient framework to learn multimodal representations. CLIP, a seminal work in this area, achieved impressive results by training on paired image-text data using the contrastive loss. Recent work…

Computer Vision and Pattern Recognition · Computer Science 2023-11-07 Enrico Fini , Pietro Astolfi , Adriana Romero-Soriano , Jakob Verbeek , Michal Drozdzal

Mainstream Audio Analytics models are trained to learn under the paradigm of one class label to many recordings focusing on one task. Learning under such restricted supervision limits the flexibility of models because they require labeled…

Sound · Computer Science 2022-06-13 Benjamin Elizalde , Soham Deshmukh , Mahmoud Al Ismail , Huaming Wang

Contrastive Language-Image Pretraining (CLIP) has emerged as a novel paradigm to learn visual models from language supervision. While researchers continue to push the frontier of CLIP, reproducing these works remains challenging. This is…

Computer Vision and Pattern Recognition · Computer Science 2022-03-14 Yufeng Cui , Lichen Zhao , Feng Liang , Yangguang Li , Jing Shao

Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an exciting new approach to learning with language supervision, demonstrates promising…

Computer Vision and Pattern Recognition · Computer Science 2021-12-24 Norman Mu , Alexander Kirillov , David Wagner , Saining Xie

Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and…

Computer Vision and Pattern Recognition · Computer Science 2024-04-25 Sachin Mehta , Maxwell Horton , Fartash Faghri , Mohammad Hossein Sekhavat , Mahyar Najibi , Mehrdad Farajtabar , Oncel Tuzel , Mohammad Rastegari

Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component…

Computation and Language · Computer Science 2023-10-23 Mengjie Zhao , Junya Ono , Zhi Zhong , Chieh-Hsin Lai , Yuhta Takida , Naoki Murata , Wei-Hsiang Liao , Takashi Shibuya , Hiromi Wakaki , Yuki Mitsufuji

Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize…

Computer Vision and Pattern Recognition · Computer Science 2025-12-17 Weiheng Zhao , Zilong Huang , Jiashi Feng , Xinggang Wang

Contrastive language-image pre-training (CLIP) serves as a de-facto standard to align images and texts. Nonetheless, the loose correlation between images and texts of web-crawled data renders the contrastive objective data inefficient and…

Computer Vision and Pattern Recognition · Computer Science 2022-10-18 Jinghao Zhou , Li Dong , Zhe Gan , Lijuan Wang , Furu Wei

Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in vision-language representation learning, powering diverse downstream tasks and serving as the default vision backbone in multimodal large language models (MLLMs).…

Computer Vision and Pattern Recognition · Computer Science 2026-01-29 Chuan Qin , Constantin Venhoff , Sonia Joseph , Fanyi Xiao , Stefan Scherer

Multi-modal learning has become increasingly popular due to its ability to leverage information from different data sources (e.g., text and images) to improve the model performance. Recently, CLIP has emerged as an effective approach that…

Machine Learning · Computer Science 2024-07-12 Zixiang Chen , Yihe Deng , Yuanzhi Li , Quanquan Gu

Contrastive language-image pre-training (CLIP) is a powerful vision-language model that has shown great benefits for various tasks. However, we have identified some issues with its explainability, which undermine its credibility and limit…

Computer Vision and Pattern Recognition · Computer Science 2024-09-17 Yi Li , Hualiang Wang , Yiqun Duan , Jiheng Zhang , Xiaomeng Li

Continual learning (CL) aims to help deep neural networks learn new knowledge while retaining what has been learned. Owing to their powerful generalizability, pre-trained vision-language models such as Contrastive Language-Image…

Computer Vision and Pattern Recognition · Computer Science 2024-11-01 Saurav Jha , Dong Gong , Lina Yao

CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal model that jointly learns representations of images and texts. The model is trained on a massive amount of English data and shows impressive performance on…

Computation and Language · Computer Science 2021-08-20 Federico Bianchi , Giuseppe Attanasio , Raphael Pisoni , Silvia Terragni , Gabriele Sarti , Sri Lakshmi

Contrastive vision-language models (e.g. CLIP) are typically created by updating all the parameters of a vision model and language model through contrastive training. Can such models be created by a small number of parameter updates to an…

Computer Vision and Pattern Recognition · Computer Science 2023-03-22 Zaid Khan , Yun Fu

Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. This makes the nature of the training data a significant factor in the efficacy of CLIP for…

Computer Vision and Pattern Recognition · Computer Science 2024-11-06 Maitreya Patel , Abhiram Kusumba , Sheng Cheng , Changhoon Kim , Tejas Gokhale , Chitta Baral , Yezhou Yang

Contrastive Language-Image Pre-training (CLIP) learns rich representations via readily available supervision of natural language. It improves the performance of downstream vision tasks, including but not limited to the zero-shot, long tail,…

Computer Vision and Pattern Recognition · Computer Science 2022-11-29 Yi Li , Hualiang Wang , Yiqun Duan , Hang Xu , Xiaomeng Li

Self-supervised learning approach like contrastive learning is attached great attention in natural language processing. It uses pairs of training data augmentations to build a classification task for an encoder with well representation…

Computation and Language · Computer Science 2021-12-03 Deshui Miao , Jiaqi Zhang , Wenbo Xie , Jian Song , Xin Li , Lijuan Jia , Ning Guo

Contrastive language-audio pretraining (CLAP) has recently emerged as a method for making audio analysis more generalisable. Specifically, CLAP-style models are able to `answer' a diverse set of language queries, extending the capabilities…

Sound · Computer Science 2024-06-12 Xin Jing , Andreas Triantafyllopoulos , Björn Schuller
‹ Prev 1 2 3 10 Next ›