Related papers: CAPro: Webly Supervised Learning with Cross-Modali…

MoPro: Webly Supervised Learning with Momentum Prototypes

We propose a webly-supervised representation learning method that does not suffer from the annotation unscalability of supervised learning, nor the computation unscalability of self-supervised learning. Most existing works on…

Computer Vision and Pattern Recognition · Computer Science 2020-09-18 Junnan Li , Caiming Xiong , Steven C. H. Hoi

Noise-Tolerant Hybrid Prototypical Learning with Noisy Web Data

We focus on the challenging problem of learning an unbiased classifier from a large number of potentially relevant but noisily labeled web images given only a few clean labeled images. This problem is particularly practical because it…

Computer Vision and Pattern Recognition · Computer Science 2025-01-07 Chao Liang , Linchao Zhu , Zongxin Yang , Wei Chen , Yi Yang

Learning Adaptive Cross-Embodiment Visuomotor Policy with Contrastive Prompt Orchestration

Learning adaptive visuomotor policies for embodied agents remains a formidable challenge, particularly when facing cross-embodiment variations such as diverse sensor configurations and dynamic properties. Conventional learning approaches…

Robotics · Computer Science 2026-02-03 Yuhang Zhang , Chao Yan , Jiaxi Yu , Jiaping Xiao , Mir Feroskhan

Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning

Image captioning is one of the straightforward tasks that can take advantage of large-scale web-crawled data which provides rich knowledge about the visual world for a captioning model. However, since web-crawled data contains image-text…

Computer Vision and Pattern Recognition · Computer Science 2023-09-28 Wooyoung Kang , Jonghwan Mun , Sungjun Lee , Byungseok Roh

Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents

Contrastive vision-language models such as CLIP have demonstrated strong performance across a wide range of multimodal tasks by learning from aligned image-text pairs. However, their ability to handle complex, real-world web documents…

Computer Vision and Pattern Recognition · Computer Science 2025-10-22 Yiqi Lin , Alex Jinpeng Wang , Linjie Li , Zhengyuan Yang , Mike Zheng Shou

Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos

The recent growth in the consumption of online media by children during early childhood necessitates data-driven tools enabling educators to filter out appropriate educational content for young learners. This paper presents an approach for…

Computer Vision and Pattern Recognition · Computer Science 2025-10-14 Rohit Gupta , Anirban Roy , Claire Christensen , Sujeong Kim , Sarah Gerard , Madeline Cincebeaux , Ajay Divakaran , Todd Grindal , Mubarak Shah

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still…

Computer Vision and Pattern Recognition · Computer Science 2021-06-14 Chao Jia , Yinfei Yang , Ye Xia , Yi-Ting Chen , Zarana Parekh , Hieu Pham , Quoc V. Le , Yunhsuan Sung , Zhen Li , Tom Duerig

Learning Visual Composition through Improved Semantic Guidance

Visual imagery does not consist of solitary objects, but instead reflects the composition of a multitude of fluid concepts. While there have been great advances in visual representation learning, such advances have focused on building…

Computer Vision and Pattern Recognition · Computer Science 2025-04-07 Austin Stone , Hagen Soltau , Robert Geirhos , Xi Yi , Ye Xia , Bingyi Cao , Kaifeng Chen , Abhijit Ogale , Jonathon Shlens

Learning from Noisy Web Data with Category-level Supervision

As tons of photos are being uploaded to public websites (e.g., Flickr, Bing, and Google) every day, learning from web data has become an increasingly popular research direction because of freely available web resources, which is also…

Computer Vision and Pattern Recognition · Computer Science 2018-05-25 Li Niu , Qingtao Tang , Ashok Veeraraghavan , Ashu Sabharwal

Semantic Contrastive Bootstrapping for Single-positive Multi-label Recognition

Learning multi-label image recognition with incomplete annotation is gaining popularity due to its superior performance and significant labor savings when compared to training with fully labeled datasets. Existing literature mainly focuses…

Computer Vision and Pattern Recognition · Computer Science 2023-07-18 Cheng Chen , Yifan Zhao , Jia Li

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize single-modal data (i.e. text or image) or limited multi-modal data (i.e. image-text…

Computation and Language · Computer Science 2022-03-15 Wei Li , Can Gao , Guocheng Niu , Xinyan Xiao , Hao Liu , Jiachen Liu , Hua Wu , Haifeng Wang

Towards Effective Visual Representations for Partial-Label Learning

Under partial-label learning (PLL) where, for each training instance, only a set of ambiguous candidate labels containing the unknown true label is accessible, contrastive learning has recently boosted the performance of PLL on vision…

Computer Vision and Pattern Recognition · Computer Science 2023-05-11 Shiyu Xia , Jiaqi Lv , Ning Xu , Gang Niu , Xin Geng

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and…

Computer Vision and Pattern Recognition · Computer Science 2024-04-25 Sachin Mehta , Maxwell Horton , Fartash Faghri , Mohammad Hossein Sekhavat , Mahyar Najibi , Mehrdad Farajtabar , Oncel Tuzel , Mohammad Rastegari

Transductive CLIP with Class-Conditional Contrastive Learning

Inspired by the remarkable zero-shot generalization capacity of vision-language pre-trained model, we seek to leverage the supervision from CLIP model to alleviate the burden of data labeling. However, such supervision inevitably contains…

Computer Vision and Pattern Recognition · Computer Science 2022-06-14 Junchu Huang , Weijie Chen , Shicai Yang , Di Xie , Shiliang Pu , Yueting Zhuang

Contrastive Learning Improves Model Robustness Under Label Noise

Deep neural network-based classifiers trained with the categorical cross-entropy (CCE) loss are sensitive to label noise in the training data. One common type of method that can mitigate the impact of label noise can be viewed as supervised…

Computer Vision and Pattern Recognition · Computer Science 2021-04-20 Aritra Ghosh , Andrew Lan

Semi-Supervised Image Captioning by Adversarially Propagating Labeled Data

We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models. Constructing a large-scale labeled image captioning dataset is an expensive task in terms of labor, time, and cost. In…

Computer Vision and Pattern Recognition · Computer Science 2023-01-27 Dong-Jin Kim , Tae-Hyun Oh , Jinsoo Choi , In So Kweon

PCSR: Pseudo-label Consistency-Guided Sample Refinement for Noisy Correspondence Learning

Cross-modal retrieval aims to align different modalities via semantic similarity. However, existing methods often assume that image-text pairs are perfectly aligned, overlooking Noisy Correspondences in real data. These misaligned pairs…

Computer Vision and Pattern Recognition · Computer Science 2026-04-09 Zhuoyao Liu , Yang Liu , Wentao Feng , Shudong Huang

BiCro: Noisy Correspondence Rectification for Multi-modality Data via Bi-directional Cross-modal Similarity Consistency

As one of the most fundamental techniques in multimodal learning, cross-modal matching aims to project various sensory modalities into a shared feature space. To achieve this, massive and correctly aligned data pairs are required for model…

Computer Vision and Pattern Recognition · Computer Science 2023-06-09 Shuo Yang , Zhaopan Xu , Kai Wang , Yang You , Hongxun Yao , Tongliang Liu , Min Xu

Combating Label Noise With A General Surrogate Model For Sample Selection

Modern deep learning systems are data-hungry. Learning with web data is one of the feasible solutions, but will introduce label noise inevitably, which can hinder the performance of deep neural networks. Sample selection is an effective way…

Computer Vision and Pattern Recognition · Computer Science 2024-12-31 Chao Liang , Linchao Zhu , Humphrey Shi , Yi Yang

CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge

Automatically generating textual descriptions for massive unlabeled images on the web can greatly benefit realistic web applications, e.g. multimodal retrieval and recommendation. However, existing models suffer from the problem of…

Computer Vision and Pattern Recognition · Computer Science 2023-03-21 Linli Yao , Weijing Chen , Qin Jin