English
Related papers

Related papers: Learning to Exploit Temporal Structure for Biomedi…

200 papers

This paper explores training medical vision-language models (VLMs) -- where the visual and language inputs are embedded into a common space -- with a particular focus on scenarios where training data is limited, as is often the case in…

Computer Vision and Pattern Recognition · Computer Science 2023-04-03 Rhydian Windsor , Amir Jamaludin , Timor Kadir , Andrew Zisserman

Multi-modal data abounds in biomedicine, such as radiology images and reports. Interpreting this data at scale is essential for improving clinical care and accelerating clinical research. Biomedical text with its complex semantics poses…

Vision-language pre-training (VLP) on large-scale image-text pairs has recently witnessed rapid progress for learning cross-modal representations. Existing pre-training methods either directly concatenate image representation and text…

Computation and Language · Computer Science 2021-03-16 Chenliang Li , Ming Yan , Haiyang Xu , Fuli Luo , Wei Wang , Bin Bi , Songfang Huang

Medical vision-language pre-training methods mainly leverage the correspondence between paired medical images and radiological reports. Although multi-view spatial images and temporal sequences of image-report pairs are available in…

Artificial Intelligence · Computer Science 2024-05-31 Jinxia Yang , Bing Su , Wayne Xin Zhao , Ji-Rong Wen

Recently a number of studies demonstrated impressive performance on diverse vision-language multi-modal tasks such as image captioning and visual question answering by extending the BERT architecture with multi-modal pre-training…

Computer Vision and Pattern Recognition · Computer Science 2022-09-22 Jong Hak Moon , Hyungyung Lee , Woncheol Shin , Young-Hak Kim , Edward Choi

Vision-and-language models (VLMs) have been increasingly explored in the medical domain, particularly following the success of CLIP in general domain. However, unlike the relatively straightforward pairing of 2D images and text, curating…

Computer Vision and Pattern Recognition · Computer Science 2025-08-19 Ziyang Zhang , Yang Yu , Xulei Yang , Si Yong Yeo

In this paper, we consider enhancing medical visual-language pre-training (VLP) with domain-specific knowledge, by exploiting the paired image-text reports from the radiological daily practice. In particular, we make the following…

Image and Video Processing · Electrical Eng. & Systems 2023-04-04 Chaoyi Wu , Xiaoman Zhang , Ya Zhang , Yanfeng Wang , Weidi Xie

Clinical outcome or severity prediction from medical images has largely focused on learning representations from single-timepoint or snapshot scans. It has been shown that disease progression can be better characterized by temporal imaging.…

Image and Video Processing · Electrical Eng. & Systems 2022-04-01 Aishik Konwer , Xuan Xu , Joseph Bae , Chao Chen , Prateek Prasanna

Joint image-text embedding extracted from medical images and associated contextual reports is the bedrock for most biomedical vision-and-language (V+L) tasks, including medical visual question answering, clinical image-text retrieval,…

Computer Vision and Pattern Recognition · Computer Science 2020-09-04 Yikuan Li , Hanyin Wang , Yuan Luo

Radiology report generation (RRG) requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. While multimodal large language models (MLLMs) align with pre-trained vision encoders to enhance…

Computer Vision and Pattern Recognition · Computer Science 2025-08-05 Xi Zhang , Zaiqiao Meng , Jake Lever , Edmond S. L. Ho

Vision-Language Pre-training (VLP) has shown the merits of analysing medical images, by leveraging the semantic congruence between medical images and their corresponding reports. It efficiently learns visual representations, which in turn…

Computer Vision and Pattern Recognition · Computer Science 2024-07-08 Xiaoxuan He , Yifan Yang , Xinyang Jiang , Xufang Luo , Haoji Hu , Siyun Zhao , Dongsheng Li , Yuqing Yang , Lili Qiu

Multimodal learning has shown promise in medical imaging, combining complementary modalities like images and text. Vision-language models (VLMs) capture rich diagnostic cues but often require large paired datasets and prompt- or text-based…

Computer Vision and Pattern Recognition · Computer Science 2025-11-26 Banafsheh Karimian , Giulia Avanzato , Soufian Belharbi , Alexis Guichemerre , Luke McCaffrey , Mohammadhadi Shateri , Eric Granger

This paper presents a comprehensive survey of vision-language (VL) intelligence from the perspective of time. This survey is inspired by the remarkable progress in both computer vision and natural language processing, and recent trends…

Computer Vision and Pattern Recognition · Computer Science 2022-03-04 Feng Li , Hao Zhang , Yi-Fan Zhang , Shilong Liu , Jian Guo , Lionel M. Ni , PengChuan Zhang , Lei Zhang

Vision-and-Language (V+L) pre-training models have achieved tremendous success in recent years on various multi-modal benchmarks. However, the majority of existing models require pre-training on a large set of parallel image-text data,…

Computer Vision and Pattern Recognition · Computer Science 2022-03-02 Mingyang Zhou , Licheng Yu , Amanpreet Singh , Mengjiao Wang , Zhou Yu , Ning Zhang

Medical Vision Language Pretraining (VLP) has recently emerged as a promising solution to the scarcity of labeled data in the medical domain. By leveraging paired/unpaired vision and text datasets through self-supervised learning, models…

Computer Vision and Pattern Recognition · Computer Science 2023-12-12 Prashant Shrestha , Sanskar Amgain , Bidur Khanal , Cristian A. Linte , Binod Bhattarai

Despite recent advances in Vision-Language Models (VLMs), they may over-rely on visual language priors existing in their training data rather than true visual reasoning. To investigate this, we introduce ViLP, a benchmark featuring…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Tiange Luo , Ang Cao , Gunhee Lee , Justin Johnson , Honglak Lee

Recent advancements in multimodal techniques open exciting possibilities for models excelling in diverse tasks involving text, audio, and image processing. Models like GPT-4V, blending computer vision and language modeling, excel in complex…

Computation and Language · Computer Science 2023-10-20 Xiang Zhang , Senyu Li , Zijun Wu , Ning Shi

Medical image segmentation typically relies solely on visual data, overlooking the rich textual information clinicians use for diagnosis. Vision-language models attempt to bridge this gap, but existing approaches often process visual and…

Computer Vision and Pattern Recognition · Computer Science 2025-04-01 Rafi Ibn Sultan , Hui Zhu , Chengyin Li , Dongxiao Zhu

In vision-language pre-training (VLP), masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment. However, in most existing methods, the reconstruction targets for MIM lack high-level semantics, and…

Computer Vision and Pattern Recognition · Computer Science 2024-03-04 Haowei Liu , Yaya Shi , Haiyang Xu , Chunfeng Yuan , Qinghao Ye , Chenliang Li , Ming Yan , Ji Zhang , Fei Huang , Bing Li , Weiming Hu

Vision-language pre-training has recently gained popularity as it allows learning rich feature representations using large-scale data sources. This paradigm has quickly made its way into the medical image analysis community. In particular,…

Computer Vision and Pattern Recognition · Computer Science 2025-04-08 Julio Silva-Rodríguez , Jose Dolz , Ismail Ben Ayed
‹ Prev 1 2 3 10 Next ›