Related papers: Representation Learning with Semantic-aware Instan…

SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels

Vision-language pre-training (VLP) on large-scale image-text pairs has recently witnessed rapid progress for learning cross-modal representations. Existing pre-training methods either directly concatenate image representation and text…

Computation and Language · Computer Science 2021-03-16 Chenliang Li , Ming Yan , Haiyang Xu , Fuli Luo , Wei Wang , Bin Bi , Songfang Huang

SCALE-VLP: Soft-Weighted Contrastive Volumetric Vision-Language Pre-training with Spatial-Knowledge Semantics

Vision-language models (VLMs) have demonstrated strong cross-modal capabilities, yet most work remains limited to 2D data and assumes binary supervision (i.e., positive vs. negative pairs), overlooking the continuous and structured…

Computer Vision and Pattern Recognition · Computer Science 2025-11-06 Ailar Mahdizadeh , Puria Azadi Moghadam , Xiangteng He , Shahriar Mirabbasi , Panos Nasiopoulos , Leonid Sigal

Learning Contrastive Representation for Semantic Correspondence

Dense correspondence across semantically related images has been extensively studied, but still faces two challenges: 1) large variations in appearance, scale and pose exist even for objects from the same category, and 2) labeling…

Computer Vision and Pattern Recognition · Computer Science 2022-03-11 Taihong Xiao , Sifei Liu , Shalini De Mello , Zhiding Yu , Jan Kautz , Ming-Hsuan Yang

Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning

The integration of visual and textual data in Vision-Language Pre-training (VLP) models is crucial for enhancing vision-language understanding. However, the adversarial robustness of these models, especially in the alignment of image-text…

Multimedia · Computer Science 2025-06-03 Youze Wang , Wenbo Hu , Yinpeng Dong , Hanwang Zhang , Hang Su , Richang Hong

MLIP: Medical Language-Image Pre-training with Masked Local Representation Learning

Existing contrastive language-image pre-training aims to learn a joint representation by matching abundant image-text pairs. However, the number of image-text pairs in medical datasets is usually orders of magnitude smaller than that in…

Computer Vision and Pattern Recognition · Computer Science 2024-01-04 Jiarun Liu , Hong-Yu Zhou , Cheng Li , Weijian Huang , Hao Yang , Yong Liang , Shanshan Wang

Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities

This paper proposes a single-stage training approach that semantically aligns three modalities - audio, visual, and text using a contrastive learning framework. Contrastive training has gained prominence for multimodal alignment, utilizing…

Sound · Computer Science 2025-05-21 Parthasaarathy Sudarsanam , Irene Martín-Morató , Tuomas Virtanen

Semantic-Aware Contrastive Learning for Multi-object Medical Image Segmentation

Medical image segmentation, or computing voxelwise semantic masks, is a fundamental yet challenging task to compute a voxel-level semantic mask. To increase the ability of encoder-decoder neural networks to perform this task across large…

Computer Vision and Pattern Recognition · Computer Science 2021-11-10 Ho Hin Lee , Yucheng Tang , Qi Yang , Xin Yu , Shunxing Bao , Leon Y. Cai , Lucas W. Remedios , Bennett A. Landman , Yuankai Huo

MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning

The scarcity of annotated data has sparked significant interest in unsupervised pre-training methods that leverage medical reports as auxiliary signals for medical visual representation learning. However, existing research overlooks the…

Computer Vision and Pattern Recognition · Computer Science 2024-02-06 Zhe Li , Laurence T. Yang , Bocheng Ren , Xin Nie , Zhangyang Gao , Cheng Tan , Stan Z. Li

Generative Negative Text Replay for Continual Vision-Language Pretraining

Vision-language pre-training (VLP) has attracted increasing attention recently. With a large amount of image-text pairs, VLP models trained with contrastive loss have achieved impressive performance in various tasks, especially the…

Computer Vision and Pattern Recognition · Computer Science 2022-11-01 Shipeng Yan , Lanqing Hong , Hang Xu , Jianhua Han , Tinne Tuytelaars , Zhenguo Li , Xuming He

Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity

This paper introduces an innovative approach to Medical Vision-Language Pre-training (Med-VLP) area in the specialized context of radiograph representation learning. While conventional methods frequently merge textual annotations into…

Computer Vision and Pattern Recognition · Computer Science 2025-02-13 Hanqi Jiang , Xixuan Hao , Yuzhou Huang , Chong Ma , Jiaxun Zhang , Yi Pan , Ruimao Zhang

Toward Modality Gap: Vision Prototype Learning for Weakly-supervised Semantic Segmentation with CLIP

The application of Contrastive Language-Image Pre-training (CLIP) in Weakly Supervised Semantic Segmentation (WSSS) research powerful cross-modal semantic understanding capabilities. Existing methods attempt to optimize input text prompts…

Computer Vision and Pattern Recognition · Computer Science 2024-12-30 Zhongxing Xu , Feilong Tang , Zhe Chen , Yingxue Su , Zhiyi Zhao , Ge Zhang , Jionglong Su , Zongyuan Ge

Weakly Supervised Vision-and-Language Pre-training with Relative Representations

Weakly supervised vision-and-language pre-training (WVLP), which learns cross-modal representations with limited cross-modal supervision, has been shown to effectively reduce the data cost of pre-training while maintaining decent…

Computer Vision and Pattern Recognition · Computer Science 2023-05-26 Chi Chen , Peng Li , Maosong Sun , Yang Liu

ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation

Vision-language pre-training (VLP) methods are blossoming recently, and its crucial goal is to jointly learn visual and textual features via a transformer-based architecture, demonstrating promising improvements on a variety of…

Computer Vision and Pattern Recognition · Computer Science 2023-09-01 Weihan Wang , Zhen Yang , Bin Xu , Juanzi Li , Yankui Sun

Positional Contrastive Learning for Volumetric Medical Image Segmentation

The success of deep learning heavily depends on the availability of large labeled training sets. However, it is hard to get large labeled datasets in medical image domain because of the strict privacy concern and costly labeling efforts.…

Computer Vision and Pattern Recognition · Computer Science 2021-09-30 Dewen Zeng , Yawen Wu , Xinrong Hu , Xiaowei Xu , Haiyun Yuan , Meiping Huang , Jian Zhuang , Jingtong Hu , Yiyu Shi

Contrastive Visual-Linguistic Pretraining

Several multi-modality representation learning approaches such as LXMERT and ViLBERT have been proposed recently. Such approaches can achieve superior performance due to the high-level semantic information captured during large-scale…

Computer Vision and Pattern Recognition · Computer Science 2020-07-28 Lei Shi , Kai Shuang , Shijie Geng , Peng Su , Zhengkai Jiang , Peng Gao , Zuohui Fu , Gerard de Melo , Sen Su

Enhancing medical vision-language contrastive learning via inter-matching relation modelling

Medical image representations can be learned through medical vision-language contrastive learning (mVLCL) where medical imaging reports are used as weak supervision through image-text alignment. These learned image representations can be…

Computer Vision and Pattern Recognition · Computer Science 2025-02-10 Mingjian Li , Mingyuan Meng , Michael Fulham , David Dagan Feng , Lei Bi , Jinman Kim

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still…

Computer Vision and Pattern Recognition · Computer Science 2021-06-14 Chao Jia , Yinfei Yang , Ye Xia , Yi-Ting Chen , Zarana Parekh , Hieu Pham , Quoc V. Le , Yunhsuan Sung , Zhen Li , Tom Duerig

FaNe: Towards Fine-Grained Cross-Modal Contrast with False-Negative Reduction and Text-Conditioned Sparse Attention

Medical vision-language pre-training (VLP) offers significant potential for advancing medical image understanding by leveraging paired image-report data. However, existing methods are limited by Fa}lse Negatives (FaNe) induced by…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Peng Zhang , Zhihui Lai , Wenting Chen , Xu Wu , Heng Kong

Semantic-aware Contrastive Learning for More Accurate Semantic Parsing

Since the meaning representations are detailed and accurate annotations which express fine-grained sequence-level semtantics, it is usually hard to train discriminative semantic parsers via Maximum Likelihood Estimation (MLE) in an…

Computation and Language · Computer Science 2023-01-20 Shan Wu , Chunlei Xin , Bo Chen , Xianpei Han , Le Sun

Medical Vision Language Pretraining: A survey

Medical Vision Language Pretraining (VLP) has recently emerged as a promising solution to the scarcity of labeled data in the medical domain. By leveraging paired/unpaired vision and text datasets through self-supervised learning, models…

Computer Vision and Pattern Recognition · Computer Science 2023-12-12 Prashant Shrestha , Sanskar Amgain , Bidur Khanal , Cristian A. Linte , Binod Bhattarai