Related papers: Learning Visual Representations with Caption Annot…

Captioning Visualizations with Large Language Models (CVLLM): A Tutorial

Automatically captioning visualizations is not new, but recent advances in large language models(LLMs) open exciting new possibilities. In this tutorial, after providing a brief review of Information Visualization (InfoVis) principles and…

Computation and Language · Computer Science 2024-07-01 Giuseppe Carenini , Jordon Johnson , Ali Salamatian

VirTex: Learning Visual Representations from Textual Annotations

The de-facto approach to many vision tasks is to start from pretrained visual representations, typically learned via supervised training on ImageNet. Recent methods have explored unsupervised pretraining to scale to vast quantities of…

Computer Vision and Pattern Recognition · Computer Science 2021-09-28 Karan Desai , Justin Johnson

A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling

When captioning an image, people describe objects in diverse ways, such as by using different terms and/or including details that are perceptually noteworthy to them. Descriptions can be especially unique across languages and cultures.…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Kyle Buettner , Jacob T. Emmerson , Adriana Kovashka

Prompt-based Learning for Unpaired Image Captioning

Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs. Existing works usually tackle this task using adversarial learning and visual concept reward based on reinforcement…

Computer Vision and Pattern Recognition · Computer Science 2022-11-21 Peipei Zhu , Xiao Wang , Lin Zhu , Zhenglong Sun , Weishi Zheng , Yaowei Wang , Changwen Chen

Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning

Significant progress has been made on visual captioning, largely relying on pre-trained features and later fixed object detectors that serve as rich inputs to auto-regressive models. A key limitation of such methods, however, is that the…

Computer Vision and Pattern Recognition · Computer Science 2022-06-09 Chia-Wen Kuo , Zsolt Kira

Data Efficient Masked Language Modeling for Vision and Language

Masked language modeling (MLM) is one of the key sub-tasks in vision-language pretraining. In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text. In…

Computation and Language · Computer Science 2021-09-07 Yonatan Bitton , Gabriel Stanovsky , Michael Elhadad , Roy Schwartz

Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning

Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering. However, improving their zero-shot reasoning typically requires second-stage…

Computer Vision and Pattern Recognition · Computer Science 2024-04-02 Rongjie Li , Yu Wu , Xuming He

Accurate Word Representations with Universal Visual Guidance

Word representation is a fundamental component in neural language understanding models. Recently, pre-trained language models (PrLMs) offer a new performant method of contextualized word representations by leveraging the sequence-level…

Computation and Language · Computer Science 2021-01-01 Zhuosheng Zhang , Haojie Yu , Hai Zhao , Rui Wang , Masao Utiyama

Enhanced Modality Transition for Image Captioning

Image captioning model is a cross-modality knowledge discovery task, which targets at automatically describing an image with an informative and coherent sentence. To generate the captions, the previous encoder-decoder frameworks directly…

Computer Vision and Pattern Recognition · Computer Science 2021-02-24 Ziwei Wang , Yadan Luo , Zi Huang

VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning

It is highly desirable yet challenging to generate image captions that can describe novel objects which are unseen in caption-labeled training data, a capability that is evaluated in the novel object captioning challenge (nocaps). In this…

Computer Vision and Pattern Recognition · Computer Science 2021-03-08 Xiaowei Hu , Xi Yin , Kevin Lin , Lijuan Wang , Lei Zhang , Jianfeng Gao , Zicheng Liu

Fusion Models for Improved Visual Captioning

Visual captioning aims to generate textual descriptions given images or videos. Traditionally, image captioning models are trained on human annotated datasets such as Flickr30k and MS-COCO, which are limited in size and diversity. This…

Computer Vision and Pattern Recognition · Computer Science 2021-03-01 Marimuthu Kalimuthu , Aditya Mogadala , Marius Mosbach , Dietrich Klakow

Accelerating Conditional Prompt Learning via Masked Image Modeling for Vision-Language Models

Vision-language models (VLMs) like CLIP excel in zero-shot learning but often require resource-intensive training to adapt to new tasks. Prompt learning techniques, such as CoOp and CoCoOp, offer efficient adaptation but tend to overfit to…

Computer Vision and Pattern Recognition · Computer Science 2025-08-08 Phuoc-Nguyen Bui , Khanh-Binh Nguyen , Hyunseung Choo

Universal Sentence Representation Learning with Conditional Masked Language Model

This paper presents a novel training method, Conditional Masked Language Modeling (CMLM), to effectively learn sentence representations on large scale unlabeled corpora. CMLM integrates sentence representation learning into MLM training by…

Computation and Language · Computer Science 2021-09-13 Ziyi Yang , Yinfei Yang , Daniel Cer , Jax Law , Eric Darve

Masked Diffusion Captioning for Visual Feature Learning

We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a…

Computer Vision and Pattern Recognition · Computer Science 2025-10-31 Chao Feng , Zihao Wei , Andrew Owens

Unified Vision-Language Pre-Training for Image Captioning and VQA

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering)…

Computer Vision and Pattern Recognition · Computer Science 2019-12-05 Luowei Zhou , Hamid Palangi , Lei Zhang , Houdong Hu , Jason J. Corso , Jianfeng Gao

Proprioception Enhances Vision Language Model in Generating Captions and Subtask Segmentations for Robot Task

From the perspective of future developments in robotics, it is crucial to verify whether foundation models trained exclusively on offline data, such as images and language, can understand the robot motion. In particular, since Vision…

Robotics · Computer Science 2026-01-13 Kanata Suzuki , Shota Shimizu , Tetsuya Ogata

Multi-Modal Representation Learning with Text-Driven Soft Masks

We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text…

Computer Vision and Pattern Recognition · Computer Science 2023-04-04 Jaeyoo Park , Bohyung Han

Masked Vision and Language Modeling for Multi-modal Representation Learning

In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint…

Computer Vision and Pattern Recognition · Computer Science 2023-03-16 Gukyeong Kwon , Zhaowei Cai , Avinash Ravichandran , Erhan Bas , Rahul Bhotika , Stefano Soatto

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Vision-and-language pre-training has achieved impressive success in learning multimodal representations between vision and language. To generalize this success to non-English languages, we introduce UC2, the first machine…

Computer Vision and Pattern Recognition · Computer Science 2021-04-02 Mingyang Zhou , Luowei Zhou , Shuohang Wang , Yu Cheng , Linjie Li , Zhou Yu , Jingjing Liu

XGPT: Cross-modal Generative Pre-Training for Image Captioning

While many BERT-based cross-modal pre-trained models produce excellent results on downstream understanding tasks like image-text retrieval and VQA, they cannot be applied to generation tasks directly. In this paper, we propose XGPT, a new…

Computation and Language · Computer Science 2020-03-05 Qiaolin Xia , Haoyang Huang , Nan Duan , Dongdong Zhang , Lei Ji , Zhifang Sui , Edward Cui , Taroon Bharti , Xin Liu , Ming Zhou