Related papers: COPA: Efficient Vision-Language Pre-training Throu…

Position-guided Text Prompt for Vision-Language Pre-training

Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization…

Computer Vision and Pattern Recognition · Computer Science 2023-06-08 Alex Jinpeng Wang , Pan Zhou , Mike Zheng Shou , Shuicheng Yan

TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

Vision Transformers (ViTs) have been widely used in large-scale Vision and Language Pre-training (VLP) models. Though previous VLP works have proved the effectiveness of ViTs, they still suffer from computational efficiency brought by the…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Chaoya Jiang , Haiyang Xu , Chenliang Li , Miang Yan , Wei Ye , Shikun Zhang , Bin Bi , Songfang Huang

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text box data to perform well on…

Computer Vision and Pattern Recognition · Computer Science 2023-10-31 Shraman Pramanick , Li Jing , Sayan Nag , Jiachen Zhu , Hardik Shah , Yann LeCun , Rama Chellappa

Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

Vision Transformers (ViTs) have become increasingly popular in large-scale Vision and Language Pre-training (VLP) models. Although previous VLP research has demonstrated the efficacy of ViTs, these efforts still struggle with computational…

Computer Vision and Pattern Recognition · Computer Science 2024-03-14 Wei Ye , Chaoya Jiang , Haiyang Xu , Chenhao Ye , Chenliang Li , Ming Yan , Shikun Zhang , Songhang Huang , Fei Huang

ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation

Vision-language pre-training (VLP) methods are blossoming recently, and its crucial goal is to jointly learn visual and textual features via a transformer-based architecture, demonstrating promising improvements on a variety of…

Computer Vision and Pattern Recognition · Computer Science 2023-09-01 Weihan Wang , Zhen Yang , Bin Xu , Juanzi Li , Yankui Sun

Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Inspired by the success of vision-language methods (VLMs) in zero-shot classification, recent works attempt to extend this line of work into object detection by leveraging the localization ability of pre-trained VLMs and generating pseudo…

Computer Vision and Pattern Recognition · Computer Science 2023-08-01 Yanxin Long , Jianhua Han , Runhui Huang , Xu Hang , Yi Zhu , Chunjing Xu , Xiaodan Liang

KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation

Self-supervised vision-and-language pretraining (VLP) aims to learn transferable multi-modal representations from large-scale image-text data and to achieve strong performances on a broad scope of vision-language tasks after finetuning.…

Computer Vision and Pattern Recognition · Computer Science 2022-08-09 Yongfei Liu , Chenfei Wu , Shao-yen Tseng , Vasudev Lal , Xuming He , Nan Duan

STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training

Although large-scale video-language pre-training models, which usually build a global alignment between the video and the text, have achieved remarkable progress on various downstream tasks, the idea of adopting fine-grained information…

Computer Vision and Pattern Recognition · Computer Science 2023-11-10 Weihong Zhong , Mao Zheng , Duyu Tang , Xuan Luo , Heng Gong , Xiaocheng Feng , Bing Qin

Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting

Recently, Vision-Language Pre-training (VLP) techniques have greatly benefited various vision-language tasks by jointly learning visual and textual representations, which intuitively helps in Optical Character Recognition (OCR) tasks due to…

Computer Vision and Pattern Recognition · Computer Science 2022-11-15 Chuhui Xue , Wenqing Zhang , Yu Hao , Shijian Lu , Philip Torr , Song Bai

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks. Current approaches to VLP heavily rely on image feature extraction processes, most of which involve region supervision…

Machine Learning · Statistics 2021-06-11 Wonjae Kim , Bokyung Son , Ildoo Kim

Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

Vision-Language Pre-training (VLP) has advanced the performance of many vision-language tasks, such as image-text retrieval, visual entailment, and visual reasoning. The pre-training mostly utilizes lexical databases and image queries in…

Computation and Language · Computer Science 2023-06-30 Yasmine Karoui , Rémi Lebret , Negar Foroutan , Karl Aberer

SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels

Vision-language pre-training (VLP) on large-scale image-text pairs has recently witnessed rapid progress for learning cross-modal representations. Existing pre-training methods either directly concatenate image representation and text…

Computation and Language · Computer Science 2021-03-16 Chenliang Li , Ming Yan , Haiyang Xu , Fuli Luo , Wei Wang , Bin Bi , Songfang Huang

VELVET-Med: Vision and Efficient Language Pre-training for Volumetric Imaging Tasks in Medicine

Vision-and-language models (VLMs) have been increasingly explored in the medical domain, particularly following the success of CLIP in general domain. However, unlike the relatively straightforward pairing of 2D images and text, curating…

Computer Vision and Pattern Recognition · Computer Science 2025-08-19 Ziyang Zhang , Yang Yu , Xulei Yang , Si Yong Yeo

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: ($i$) VLP for image-text tasks, such as image…

Computer Vision and Pattern Recognition · Computer Science 2022-10-18 Zhe Gan , Linjie Li , Chunyuan Li , Lijuan Wang , Zicheng Liu , Jianfeng Gao

VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation

Vision Transformers (ViTs) have emerged as the backbone of many segmentation models, consistently achieving state-of-the-art (SOTA) performance. However, their success comes at a significant computational cost. Image token pruning is one of…

Computer Vision and Pattern Recognition · Computer Science 2024-12-02 Hanning Chen , Yang Ni , Wenjun Huang , Yezi Liu , SungHeon Jeong , Fei Wen , Nathaniel Bastian , Hugo Latapie , Mohsen Imani

BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization

Vision Transformer (ViT) based Vision-Language Pre-training (VLP) models have demonstrated impressive performance in various tasks. However, the lengthy visual token sequences fed into ViT can lead to training inefficiency and…

Computer Vision and Pattern Recognition · Computer Science 2024-02-27 Chaoya Jiang , Haiyang Xu , Wei Ye , Qinghao Ye , Chenliang Li , Ming Yan , Bin Bi , Shikun Zhang , Fei Huang , Songfang Huang

CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation

Vision-Language Pretraining (VLP) has shown impressive results on diverse downstream tasks by offline training on large-scale datasets. Regarding the growing nature of real-world data, such an offline training paradigm on ever-expanding…

Computer Vision and Pattern Recognition · Computer Science 2023-08-15 Hongguang Zhu , Yunchao Wei , Xiaodan Liang , Chunjie Zhang , Yao Zhao

Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training

Existing approaches to vision-language pre-training (VLP) heavily rely on an object detector based on bounding boxes (regions), where salient objects are first detected from images and then a Transformer-based model is used for cross-modal…

Multimedia · Computer Science 2021-08-24 Ming Yan , Haiyang Xu , Chenliang Li , Bin Bi , Junfeng Tian , Min Gui , Wei Wang

Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models

Prompt learning has become one of the most efficient paradigms for adapting large pre-trained vision-language models to downstream tasks. Current state-of-the-art methods, like CoOp and ProDA, tend to adopt soft prompts to learn an…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Sifan Long , Zhen Zhao , Junkun Yuan , Zichang Tan , Jiangjiang Liu , Luping Zhou , Shengsheng Wang , Jingdong Wang

Probing Visual Language Priors in VLMs

Despite recent advances in Vision-Language Models (VLMs), they may over-rely on visual language priors existing in their training data rather than true visual reasoning. To investigate this, we introduce ViLP, a benchmark featuring…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Tiange Luo , Ang Cao , Gunhee Lee , Justin Johnson , Honglak Lee