English
Related papers

Related papers: Learning Visual Representations with Caption Annot…

200 papers

Automatically captioning visualizations is not new, but recent advances in large language models(LLMs) open exciting new possibilities. In this tutorial, after providing a brief review of Information Visualization (InfoVis) principles and…

Computation and Language · Computer Science 2024-07-01 Giuseppe Carenini , Jordon Johnson , Ali Salamatian

The de-facto approach to many vision tasks is to start from pretrained visual representations, typically learned via supervised training on ImageNet. Recent methods have explored unsupervised pretraining to scale to vast quantities of…

Computer Vision and Pattern Recognition · Computer Science 2021-09-28 Karan Desai , Justin Johnson

When captioning an image, people describe objects in diverse ways, such as by using different terms and/or including details that are perceptually noteworthy to them. Descriptions can be especially unique across languages and cultures.…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Kyle Buettner , Jacob T. Emmerson , Adriana Kovashka

Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs. Existing works usually tackle this task using adversarial learning and visual concept reward based on reinforcement…

Computer Vision and Pattern Recognition · Computer Science 2022-11-21 Peipei Zhu , Xiao Wang , Lin Zhu , Zhenglong Sun , Weishi Zheng , Yaowei Wang , Changwen Chen

Significant progress has been made on visual captioning, largely relying on pre-trained features and later fixed object detectors that serve as rich inputs to auto-regressive models. A key limitation of such methods, however, is that the…

Computer Vision and Pattern Recognition · Computer Science 2022-06-09 Chia-Wen Kuo , Zsolt Kira

Masked language modeling (MLM) is one of the key sub-tasks in vision-language pretraining. In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text. In…

Computation and Language · Computer Science 2021-09-07 Yonatan Bitton , Gabriel Stanovsky , Michael Elhadad , Roy Schwartz

Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering. However, improving their zero-shot reasoning typically requires second-stage…

Computer Vision and Pattern Recognition · Computer Science 2024-04-02 Rongjie Li , Yu Wu , Xuming He

Word representation is a fundamental component in neural language understanding models. Recently, pre-trained language models (PrLMs) offer a new performant method of contextualized word representations by leveraging the sequence-level…

Computation and Language · Computer Science 2021-01-01 Zhuosheng Zhang , Haojie Yu , Hai Zhao , Rui Wang , Masao Utiyama

Image captioning model is a cross-modality knowledge discovery task, which targets at automatically describing an image with an informative and coherent sentence. To generate the captions, the previous encoder-decoder frameworks directly…

Computer Vision and Pattern Recognition · Computer Science 2021-02-24 Ziwei Wang , Yadan Luo , Zi Huang

It is highly desirable yet challenging to generate image captions that can describe novel objects which are unseen in caption-labeled training data, a capability that is evaluated in the novel object captioning challenge (nocaps). In this…

Computer Vision and Pattern Recognition · Computer Science 2021-03-08 Xiaowei Hu , Xi Yin , Kevin Lin , Lijuan Wang , Lei Zhang , Jianfeng Gao , Zicheng Liu

Visual captioning aims to generate textual descriptions given images or videos. Traditionally, image captioning models are trained on human annotated datasets such as Flickr30k and MS-COCO, which are limited in size and diversity. This…

Computer Vision and Pattern Recognition · Computer Science 2021-03-01 Marimuthu Kalimuthu , Aditya Mogadala , Marius Mosbach , Dietrich Klakow

Vision-language models (VLMs) like CLIP excel in zero-shot learning but often require resource-intensive training to adapt to new tasks. Prompt learning techniques, such as CoOp and CoCoOp, offer efficient adaptation but tend to overfit to…

Computer Vision and Pattern Recognition · Computer Science 2025-08-08 Phuoc-Nguyen Bui , Khanh-Binh Nguyen , Hyunseung Choo

This paper presents a novel training method, Conditional Masked Language Modeling (CMLM), to effectively learn sentence representations on large scale unlabeled corpora. CMLM integrates sentence representation learning into MLM training by…

Computation and Language · Computer Science 2021-09-13 Ziyi Yang , Yinfei Yang , Daniel Cer , Jax Law , Eric Darve

We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a…

Computer Vision and Pattern Recognition · Computer Science 2025-10-31 Chao Feng , Zihao Wei , Andrew Owens

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering)…

Computer Vision and Pattern Recognition · Computer Science 2019-12-05 Luowei Zhou , Hamid Palangi , Lei Zhang , Houdong Hu , Jason J. Corso , Jianfeng Gao

From the perspective of future developments in robotics, it is crucial to verify whether foundation models trained exclusively on offline data, such as images and language, can understand the robot motion. In particular, since Vision…

Robotics · Computer Science 2026-01-13 Kanata Suzuki , Shota Shimizu , Tetsuya Ogata

We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text…

Computer Vision and Pattern Recognition · Computer Science 2023-04-04 Jaeyoo Park , Bohyung Han

In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint…

Computer Vision and Pattern Recognition · Computer Science 2023-03-16 Gukyeong Kwon , Zhaowei Cai , Avinash Ravichandran , Erhan Bas , Rahul Bhotika , Stefano Soatto

Vision-and-language pre-training has achieved impressive success in learning multimodal representations between vision and language. To generalize this success to non-English languages, we introduce UC2, the first machine…

Computer Vision and Pattern Recognition · Computer Science 2021-04-02 Mingyang Zhou , Luowei Zhou , Shuohang Wang , Yu Cheng , Linjie Li , Zhou Yu , Jingjing Liu

While many BERT-based cross-modal pre-trained models produce excellent results on downstream understanding tasks like image-text retrieval and VQA, they cannot be applied to generation tasks directly. In this paper, we propose XGPT, a new…

Computation and Language · Computer Science 2020-03-05 Qiaolin Xia , Haoyang Huang , Nan Duan , Dongdong Zhang , Lei Ji , Zhifang Sui , Edward Cui , Taroon Bharti , Xin Liu , Ming Zhou
‹ Prev 1 2 3 10 Next ›