English
Related papers

Related papers: Representation Learning with Semantic-aware Instan…

200 papers

Vision-language pre-training (VLP) on large-scale image-text pairs has recently witnessed rapid progress for learning cross-modal representations. Existing pre-training methods either directly concatenate image representation and text…

Computation and Language · Computer Science 2021-03-16 Chenliang Li , Ming Yan , Haiyang Xu , Fuli Luo , Wei Wang , Bin Bi , Songfang Huang

Vision-language models (VLMs) have demonstrated strong cross-modal capabilities, yet most work remains limited to 2D data and assumes binary supervision (i.e., positive vs. negative pairs), overlooking the continuous and structured…

Computer Vision and Pattern Recognition · Computer Science 2025-11-06 Ailar Mahdizadeh , Puria Azadi Moghadam , Xiangteng He , Shahriar Mirabbasi , Panos Nasiopoulos , Leonid Sigal

Dense correspondence across semantically related images has been extensively studied, but still faces two challenges: 1) large variations in appearance, scale and pose exist even for objects from the same category, and 2) labeling…

Computer Vision and Pattern Recognition · Computer Science 2022-03-11 Taihong Xiao , Sifei Liu , Shalini De Mello , Zhiding Yu , Jan Kautz , Ming-Hsuan Yang

The integration of visual and textual data in Vision-Language Pre-training (VLP) models is crucial for enhancing vision-language understanding. However, the adversarial robustness of these models, especially in the alignment of image-text…

Multimedia · Computer Science 2025-06-03 Youze Wang , Wenbo Hu , Yinpeng Dong , Hanwang Zhang , Hang Su , Richang Hong

Existing contrastive language-image pre-training aims to learn a joint representation by matching abundant image-text pairs. However, the number of image-text pairs in medical datasets is usually orders of magnitude smaller than that in…

Computer Vision and Pattern Recognition · Computer Science 2024-01-04 Jiarun Liu , Hong-Yu Zhou , Cheng Li , Weijian Huang , Hao Yang , Yong Liang , Shanshan Wang

This paper proposes a single-stage training approach that semantically aligns three modalities - audio, visual, and text using a contrastive learning framework. Contrastive training has gained prominence for multimodal alignment, utilizing…

Sound · Computer Science 2025-05-21 Parthasaarathy Sudarsanam , Irene Martín-Morató , Tuomas Virtanen

Medical image segmentation, or computing voxelwise semantic masks, is a fundamental yet challenging task to compute a voxel-level semantic mask. To increase the ability of encoder-decoder neural networks to perform this task across large…

Computer Vision and Pattern Recognition · Computer Science 2021-11-10 Ho Hin Lee , Yucheng Tang , Qi Yang , Xin Yu , Shunxing Bao , Leon Y. Cai , Lucas W. Remedios , Bennett A. Landman , Yuankai Huo

The scarcity of annotated data has sparked significant interest in unsupervised pre-training methods that leverage medical reports as auxiliary signals for medical visual representation learning. However, existing research overlooks the…

Computer Vision and Pattern Recognition · Computer Science 2024-02-06 Zhe Li , Laurence T. Yang , Bocheng Ren , Xin Nie , Zhangyang Gao , Cheng Tan , Stan Z. Li

Vision-language pre-training (VLP) has attracted increasing attention recently. With a large amount of image-text pairs, VLP models trained with contrastive loss have achieved impressive performance in various tasks, especially the…

Computer Vision and Pattern Recognition · Computer Science 2022-11-01 Shipeng Yan , Lanqing Hong , Hang Xu , Jianhua Han , Tinne Tuytelaars , Zhenguo Li , Xuming He

This paper introduces an innovative approach to Medical Vision-Language Pre-training (Med-VLP) area in the specialized context of radiograph representation learning. While conventional methods frequently merge textual annotations into…

Computer Vision and Pattern Recognition · Computer Science 2025-02-13 Hanqi Jiang , Xixuan Hao , Yuzhou Huang , Chong Ma , Jiaxun Zhang , Yi Pan , Ruimao Zhang

The application of Contrastive Language-Image Pre-training (CLIP) in Weakly Supervised Semantic Segmentation (WSSS) research powerful cross-modal semantic understanding capabilities. Existing methods attempt to optimize input text prompts…

Computer Vision and Pattern Recognition · Computer Science 2024-12-30 Zhongxing Xu , Feilong Tang , Zhe Chen , Yingxue Su , Zhiyi Zhao , Ge Zhang , Jionglong Su , Zongyuan Ge

Weakly supervised vision-and-language pre-training (WVLP), which learns cross-modal representations with limited cross-modal supervision, has been shown to effectively reduce the data cost of pre-training while maintaining decent…

Computer Vision and Pattern Recognition · Computer Science 2023-05-26 Chi Chen , Peng Li , Maosong Sun , Yang Liu

Vision-language pre-training (VLP) methods are blossoming recently, and its crucial goal is to jointly learn visual and textual features via a transformer-based architecture, demonstrating promising improvements on a variety of…

Computer Vision and Pattern Recognition · Computer Science 2023-09-01 Weihan Wang , Zhen Yang , Bin Xu , Juanzi Li , Yankui Sun

The success of deep learning heavily depends on the availability of large labeled training sets. However, it is hard to get large labeled datasets in medical image domain because of the strict privacy concern and costly labeling efforts.…

Computer Vision and Pattern Recognition · Computer Science 2021-09-30 Dewen Zeng , Yawen Wu , Xinrong Hu , Xiaowei Xu , Haiyun Yuan , Meiping Huang , Jian Zhuang , Jingtong Hu , Yiyu Shi

Several multi-modality representation learning approaches such as LXMERT and ViLBERT have been proposed recently. Such approaches can achieve superior performance due to the high-level semantic information captured during large-scale…

Computer Vision and Pattern Recognition · Computer Science 2020-07-28 Lei Shi , Kai Shuang , Shijie Geng , Peng Su , Zhengkai Jiang , Peng Gao , Zuohui Fu , Gerard de Melo , Sen Su

Medical image representations can be learned through medical vision-language contrastive learning (mVLCL) where medical imaging reports are used as weak supervision through image-text alignment. These learned image representations can be…

Computer Vision and Pattern Recognition · Computer Science 2025-02-10 Mingjian Li , Mingyuan Meng , Michael Fulham , David Dagan Feng , Lei Bi , Jinman Kim

Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still…

Computer Vision and Pattern Recognition · Computer Science 2021-06-14 Chao Jia , Yinfei Yang , Ye Xia , Yi-Ting Chen , Zarana Parekh , Hieu Pham , Quoc V. Le , Yunhsuan Sung , Zhen Li , Tom Duerig

Medical vision-language pre-training (VLP) offers significant potential for advancing medical image understanding by leveraging paired image-report data. However, existing methods are limited by Fa}lse Negatives (FaNe) induced by…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Peng Zhang , Zhihui Lai , Wenting Chen , Xu Wu , Heng Kong

Since the meaning representations are detailed and accurate annotations which express fine-grained sequence-level semtantics, it is usually hard to train discriminative semantic parsers via Maximum Likelihood Estimation (MLE) in an…

Computation and Language · Computer Science 2023-01-20 Shan Wu , Chunlei Xin , Bo Chen , Xianpei Han , Le Sun

Medical Vision Language Pretraining (VLP) has recently emerged as a promising solution to the scarcity of labeled data in the medical domain. By leveraging paired/unpaired vision and text datasets through self-supervised learning, models…

Computer Vision and Pattern Recognition · Computer Science 2023-12-12 Prashant Shrestha , Sanskar Amgain , Bidur Khanal , Cristian A. Linte , Binod Bhattarai
‹ Prev 1 2 3 10 Next ›