Related papers: Learning to Exploit Temporal Structure for Biomedi…

Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime

This paper explores training medical vision-language models (VLMs) -- where the visual and language inputs are embedded into a common space -- with a particular focus on scenarios where training data is limited, as is often the case in…

Computer Vision and Pattern Recognition · Computer Science 2023-04-03 Rhydian Windsor , Amir Jamaludin , Timor Kadir , Andrew Zisserman

Making the Most of Text Semantics to Improve Biomedical Vision--Language Processing

Multi-modal data abounds in biomedicine, such as radiology images and reports. Interpreting this data at scale is essential for improving clinical care and accelerating clinical research. Biomedical text with its complex semantics poses…

Computer Vision and Pattern Recognition · Computer Science 2022-12-08 Benedikt Boecking , Naoto Usuyama , Shruthi Bannur , Daniel C. Castro , Anton Schwaighofer , Stephanie Hyland , Maria Wetscherek , Tristan Naumann , Aditya Nori , Javier Alvarez-Valle , Hoifung Poon , Ozan Oktay

SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels

Vision-language pre-training (VLP) on large-scale image-text pairs has recently witnessed rapid progress for learning cross-modal representations. Existing pre-training methods either directly concatenate image representation and text…

Computation and Language · Computer Science 2021-03-16 Chenliang Li , Ming Yan , Haiyang Xu , Fuli Luo , Wei Wang , Bin Bi , Songfang Huang

Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training

Medical vision-language pre-training methods mainly leverage the correspondence between paired medical images and radiological reports. Although multi-view spatial images and temporal sequences of image-report pairs are available in…

Artificial Intelligence · Computer Science 2024-05-31 Jinxia Yang , Bing Su , Wayne Xin Zhao , Ji-Rong Wen

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Recently a number of studies demonstrated impressive performance on diverse vision-language multi-modal tasks such as image captioning and visual question answering by extending the BERT architecture with multi-modal pre-training…

Computer Vision and Pattern Recognition · Computer Science 2022-09-22 Jong Hak Moon , Hyungyung Lee , Woncheol Shin , Young-Hak Kim , Edward Choi

VELVET-Med: Vision and Efficient Language Pre-training for Volumetric Imaging Tasks in Medicine

Vision-and-language models (VLMs) have been increasingly explored in the medical domain, particularly following the success of CLIP in general domain. However, unlike the relatively straightforward pairing of 2D images and text, curating…

Computer Vision and Pattern Recognition · Computer Science 2025-08-19 Ziyang Zhang , Yang Yu , Xulei Yang , Si Yong Yeo

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training in Radiology

In this paper, we consider enhancing medical visual-language pre-training (VLP) with domain-specific knowledge, by exploiting the paired image-text reports from the radiological daily practice. In particular, we make the following…

Image and Video Processing · Electrical Eng. & Systems 2023-04-04 Chaoyi Wu , Xiaoman Zhang , Ya Zhang , Yanfeng Wang , Weidi Xie

Temporal Context Matters: Enhancing Single Image Prediction with Disease Progression Representations

Clinical outcome or severity prediction from medical images has largely focused on learning representations from single-timepoint or snapshot scans. It has been shown that disease progression can be better characterized by temporal imaging.…

Image and Video Processing · Electrical Eng. & Systems 2022-04-01 Aishik Konwer , Xuan Xu , Joseph Bae , Chao Chen , Prateek Prasanna

A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports

Joint image-text embedding extracted from medical images and associated contextual reports is the bedrock for most biomedical vision-and-language (V+L) tasks, including medical visual question answering, clinical image-text retrieval,…

Computer Vision and Pattern Recognition · Computer Science 2020-09-04 Yikuan Li , Hanyin Wang , Yuan Luo

Libra: Leveraging Temporal Images for Biomedical Radiology Analysis

Radiology report generation (RRG) requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. While multimodal large language models (MLLMs) align with pre-trained vision encoders to enhance…

Computer Vision and Pattern Recognition · Computer Science 2025-08-05 Xi Zhang , Zaiqiao Meng , Jake Lever , Edmond S. L. Ho

Unified Medical Image Pre-training in Language-Guided Common Semantic Space

Vision-Language Pre-training (VLP) has shown the merits of analysing medical images, by leveraging the semantic congruence between medical images and their corresponding reports. It efficiently learns visual representations, which in turn…

Computer Vision and Pattern Recognition · Computer Science 2024-07-08 Xiaoxuan He , Yifan Yang , Xinyang Jiang , Xufang Luo , Haoji Hu , Siyun Zhao , Dongsheng Li , Yuqing Yang , Lili Qiu

CLIP-IT: CLIP-based Pairing for Histology Images Classification

Multimodal learning has shown promise in medical imaging, combining complementary modalities like images and text. Vision-language models (VLMs) capture rich diagnostic cues but often require large paired datasets and prompt- or text-based…

Computer Vision and Pattern Recognition · Computer Science 2025-11-26 Banafsheh Karimian , Giulia Avanzato , Soufian Belharbi , Alexis Guichemerre , Luke McCaffrey , Mohammadhadi Shateri , Eric Granger

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

This paper presents a comprehensive survey of vision-language (VL) intelligence from the perspective of time. This survey is inspired by the remarkable progress in both computer vision and natural language processing, and recent trends…

Computer Vision and Pattern Recognition · Computer Science 2022-03-04 Feng Li , Hao Zhang , Yi-Fan Zhang , Shilong Liu , Jian Guo , Lionel M. Ni , PengChuan Zhang , Lei Zhang

Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment

Vision-and-Language (V+L) pre-training models have achieved tremendous success in recent years on various multi-modal benchmarks. However, the majority of existing models require pre-training on a large set of parallel image-text data,…

Computer Vision and Pattern Recognition · Computer Science 2022-03-02 Mingyang Zhou , Licheng Yu , Amanpreet Singh , Mengjiao Wang , Zhou Yu , Ning Zhang

Medical Vision Language Pretraining: A survey

Medical Vision Language Pretraining (VLP) has recently emerged as a promising solution to the scarcity of labeled data in the medical domain. By leveraging paired/unpaired vision and text datasets through self-supervised learning, models…

Computer Vision and Pattern Recognition · Computer Science 2023-12-12 Prashant Shrestha , Sanskar Amgain , Bidur Khanal , Cristian A. Linte , Binod Bhattarai

Probing Visual Language Priors in VLMs

Despite recent advances in Vision-Language Models (VLMs), they may over-rely on visual language priors existing in their training data rather than true visual reasoning. To investigate this, we introduce ViLP, a benchmark featuring…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Tiange Luo , Ang Cao , Gunhee Lee , Justin Johnson , Honglak Lee

Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond

Recent advancements in multimodal techniques open exciting possibilities for models excelling in diverse tasks involving text, audio, and image processing. Models like GPT-4V, blending computer vision and language modeling, excel in complex…

Computation and Language · Computer Science 2023-10-20 Xiang Zhang , Senyu Li , Zijun Wu , Ning Shi

BiPVL-Seg: Bidirectional Progressive Vision-Language Fusion with Global-Local Alignment for Medical Image Segmentation

Medical image segmentation typically relies solely on visual data, overlooking the rich textual information clinicians use for diagnosis. Vision-language models attempt to bridge this gap, but existing approaches often process visual and…

Computer Vision and Pattern Recognition · Computer Science 2025-04-01 Rafi Ibn Sultan , Hui Zhu , Chengyin Li , Dongxiao Zhu

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training

In vision-language pre-training (VLP), masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment. However, in most existing methods, the reconstruction targets for MIM lack high-level semantics, and…

Computer Vision and Pattern Recognition · Computer Science 2024-03-04 Haowei Liu , Yaya Shi , Haiyang Xu , Chunfeng Yuan , Qinghao Ye , Chenliang Li , Ming Yan , Ji Zhang , Fei Huang , Bing Li , Weiming Hu

A Reality Check of Vision-Language Pre-training in Radiology: Have We Progressed Using Text?

Vision-language pre-training has recently gained popularity as it allows learning rich feature representations using large-scale data sources. This paradigm has quickly made its way into the medical image analysis community. In particular,…

Computer Vision and Pattern Recognition · Computer Science 2025-04-08 Julio Silva-Rodríguez , Jose Dolz , Ismail Ben Ayed