English
Related papers

Related papers: COPA: Efficient Vision-Language Pre-training Throu…

200 papers

Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization…

Computer Vision and Pattern Recognition · Computer Science 2023-06-08 Alex Jinpeng Wang , Pan Zhou , Mike Zheng Shou , Shuicheng Yan

Vision Transformers (ViTs) have been widely used in large-scale Vision and Language Pre-training (VLP) models. Though previous VLP works have proved the effectiveness of ViTs, they still suffer from computational efficiency brought by the…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Chaoya Jiang , Haiyang Xu , Chenliang Li , Miang Yan , Wei Ye , Shikun Zhang , Bin Bi , Songfang Huang

Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text box data to perform well on…

Computer Vision and Pattern Recognition · Computer Science 2023-10-31 Shraman Pramanick , Li Jing , Sayan Nag , Jiachen Zhu , Hardik Shah , Yann LeCun , Rama Chellappa

Vision Transformers (ViTs) have become increasingly popular in large-scale Vision and Language Pre-training (VLP) models. Although previous VLP research has demonstrated the efficacy of ViTs, these efforts still struggle with computational…

Computer Vision and Pattern Recognition · Computer Science 2024-03-14 Wei Ye , Chaoya Jiang , Haiyang Xu , Chenhao Ye , Chenliang Li , Ming Yan , Shikun Zhang , Songhang Huang , Fei Huang

Vision-language pre-training (VLP) methods are blossoming recently, and its crucial goal is to jointly learn visual and textual features via a transformer-based architecture, demonstrating promising improvements on a variety of…

Computer Vision and Pattern Recognition · Computer Science 2023-09-01 Weihan Wang , Zhen Yang , Bin Xu , Juanzi Li , Yankui Sun

Inspired by the success of vision-language methods (VLMs) in zero-shot classification, recent works attempt to extend this line of work into object detection by leveraging the localization ability of pre-trained VLMs and generating pseudo…

Computer Vision and Pattern Recognition · Computer Science 2023-08-01 Yanxin Long , Jianhua Han , Runhui Huang , Xu Hang , Yi Zhu , Chunjing Xu , Xiaodan Liang

Self-supervised vision-and-language pretraining (VLP) aims to learn transferable multi-modal representations from large-scale image-text data and to achieve strong performances on a broad scope of vision-language tasks after finetuning.…

Computer Vision and Pattern Recognition · Computer Science 2022-08-09 Yongfei Liu , Chenfei Wu , Shao-yen Tseng , Vasudev Lal , Xuming He , Nan Duan

Although large-scale video-language pre-training models, which usually build a global alignment between the video and the text, have achieved remarkable progress on various downstream tasks, the idea of adopting fine-grained information…

Computer Vision and Pattern Recognition · Computer Science 2023-11-10 Weihong Zhong , Mao Zheng , Duyu Tang , Xuan Luo , Heng Gong , Xiaocheng Feng , Bing Qin

Recently, Vision-Language Pre-training (VLP) techniques have greatly benefited various vision-language tasks by jointly learning visual and textual representations, which intuitively helps in Optical Character Recognition (OCR) tasks due to…

Computer Vision and Pattern Recognition · Computer Science 2022-11-15 Chuhui Xue , Wenqing Zhang , Yu Hao , Shijian Lu , Philip Torr , Song Bai

Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks. Current approaches to VLP heavily rely on image feature extraction processes, most of which involve region supervision…

Machine Learning · Statistics 2021-06-11 Wonjae Kim , Bokyung Son , Ildoo Kim

Vision-Language Pre-training (VLP) has advanced the performance of many vision-language tasks, such as image-text retrieval, visual entailment, and visual reasoning. The pre-training mostly utilizes lexical databases and image queries in…

Computation and Language · Computer Science 2023-06-30 Yasmine Karoui , Rémi Lebret , Negar Foroutan , Karl Aberer

Vision-language pre-training (VLP) on large-scale image-text pairs has recently witnessed rapid progress for learning cross-modal representations. Existing pre-training methods either directly concatenate image representation and text…

Computation and Language · Computer Science 2021-03-16 Chenliang Li , Ming Yan , Haiyang Xu , Fuli Luo , Wei Wang , Bin Bi , Songfang Huang

Vision-and-language models (VLMs) have been increasingly explored in the medical domain, particularly following the success of CLIP in general domain. However, unlike the relatively straightforward pairing of 2D images and text, curating…

Computer Vision and Pattern Recognition · Computer Science 2025-08-19 Ziyang Zhang , Yang Yu , Xulei Yang , Si Yong Yeo

This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: ($i$) VLP for image-text tasks, such as image…

Computer Vision and Pattern Recognition · Computer Science 2022-10-18 Zhe Gan , Linjie Li , Chunyuan Li , Lijuan Wang , Zicheng Liu , Jianfeng Gao

Vision Transformers (ViTs) have emerged as the backbone of many segmentation models, consistently achieving state-of-the-art (SOTA) performance. However, their success comes at a significant computational cost. Image token pruning is one of…

Computer Vision and Pattern Recognition · Computer Science 2024-12-02 Hanning Chen , Yang Ni , Wenjun Huang , Yezi Liu , SungHeon Jeong , Fei Wen , Nathaniel Bastian , Hugo Latapie , Mohsen Imani

Vision Transformer (ViT) based Vision-Language Pre-training (VLP) models have demonstrated impressive performance in various tasks. However, the lengthy visual token sequences fed into ViT can lead to training inefficiency and…

Computer Vision and Pattern Recognition · Computer Science 2024-02-27 Chaoya Jiang , Haiyang Xu , Wei Ye , Qinghao Ye , Chenliang Li , Ming Yan , Bin Bi , Shikun Zhang , Fei Huang , Songfang Huang

Vision-Language Pretraining (VLP) has shown impressive results on diverse downstream tasks by offline training on large-scale datasets. Regarding the growing nature of real-world data, such an offline training paradigm on ever-expanding…

Computer Vision and Pattern Recognition · Computer Science 2023-08-15 Hongguang Zhu , Yunchao Wei , Xiaodan Liang , Chunjie Zhang , Yao Zhao

Existing approaches to vision-language pre-training (VLP) heavily rely on an object detector based on bounding boxes (regions), where salient objects are first detected from images and then a Transformer-based model is used for cross-modal…

Multimedia · Computer Science 2021-08-24 Ming Yan , Haiyang Xu , Chenliang Li , Bin Bi , Junfeng Tian , Min Gui , Wei Wang

Prompt learning has become one of the most efficient paradigms for adapting large pre-trained vision-language models to downstream tasks. Current state-of-the-art methods, like CoOp and ProDA, tend to adopt soft prompts to learn an…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Sifan Long , Zhen Zhao , Junkun Yuan , Zichang Tan , Jiangjiang Liu , Luping Zhou , Shengsheng Wang , Jingdong Wang

Despite recent advances in Vision-Language Models (VLMs), they may over-rely on visual language priors existing in their training data rather than true visual reasoning. To investigate this, we introduce ViLP, a benchmark featuring…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Tiange Luo , Ang Cao , Gunhee Lee , Justin Johnson , Honglak Lee
‹ Prev 1 2 3 10 Next ›