Related papers: Question-controlled Text-aware Image Captioning

Towards Accurate Text-based Image Captioning with Content Diversity Exploration

Text-based image captioning (TextCap) which aims to read and reason images with texts is crucial for a machine to understand a detailed and complex scene environment, considering that texts are omnipresent in daily life. This task, however,…

Computer Vision and Pattern Recognition · Computer Science 2021-05-10 Guanghui Xu , Shuaicheng Niu , Mingkui Tan , Yucheng Luo , Qing Du , Qi Wu

Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions

Current captioning approaches can describe images using black-box architectures whose behavior is hardly controllable and explainable from the exterior. As an image can be described in infinite ways depending on the goal and the context at…

Computer Vision and Pattern Recognition · Computer Science 2019-05-10 Marcella Cornia , Lorenzo Baraldi , Rita Cucchiara

PromptCap: Prompt-Guided Task-Aware Image Captioning

Knowledge-based visual question answering (VQA) involves questions that require world knowledge beyond the image to yield the correct answer. Large language models (LMs) like GPT-3 are particularly helpful for this task because of their…

Computer Vision and Pattern Recognition · Computer Science 2023-08-21 Yushi Hu , Hang Hua , Zhengyuan Yang , Weijia Shi , Noah A Smith , Jiebo Luo

Joint Image Captioning and Question Answering

Answering visual questions need acquire daily common knowledge and model the semantic connection among different parts in images, which is too difficult for VQA systems to learn from images with the only supervision from answers. Meanwhile,…

Computation and Language · Computer Science 2018-05-23 Jialin Wu , Zeyuan Hu , Raymond J. Mooney

Understanding Guided Image Captioning Performance across Domains

Image captioning models generally lack the capability to take into account user interest, and usually default to global descriptions that try to balance readability, informativeness, and information overload. On the other hand, VQA models…

Computer Vision and Pattern Recognition · Computer Science 2021-11-12 Edwin G. Ng , Bo Pang , Piyush Sharma , Radu Soricut

Beam-Guided Knowledge Replay for Knowledge-Rich Image Captioning using Vision-Language Model

Generating informative and knowledge-rich image captions remains a challenge for many existing captioning models, which often produce generic descriptions that lack specificity and contextual depth. To address this limitation, we propose…

Computer Vision and Pattern Recognition · Computer Science 2025-05-30 Reem AlJunaid , Muzammil Behzad

Generating Diverse and Meaningful Captions

Image Captioning is a task that requires models to acquire a multi-modal understanding of the world and to express this understanding in natural language text. While the state-of-the-art for this task has rapidly improved in terms of n-gram…

Computer Vision and Pattern Recognition · Computer Science 2018-12-20 Annika Lindh , Robert J. Ross , Abhijit Mahalunkar , Giancarlo Salton , John D. Kelleher

CapOnImage: Context-driven Dense-Captioning on Image

Existing image captioning systems are dedicated to generating narrative captions for images, which are spatially detached from the image in presentation. However, texts can also be used as decorations on the image to highlight the key…

Computer Vision and Pattern Recognition · Computer Science 2022-04-28 Yiqi Gao , Xinglin Hou , Yuanmeng Zhang , Tiezheng Ge , Yuning Jiang , Peng Wang

Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights

Contextualized Image Captioning (CIC) evolves traditional image captioning into a more complex domain, necessitating the ability for multimodal reasoning. It aims to generate image captions given specific contextual information. This paper…

Computer Vision and Pattern Recognition · Computer Science 2024-07-17 Shunqi Mao , Chaoyi Zhang , Hang Su , Hwanjun Song , Igor Shalyminov , Weidong Cai

TextCaps: a Dataset for Image Captioning with Reading Comprehension

Image descriptions can help visually impaired people to quickly understand the image content. While we made significant progress in automatically describing images and optical character recognition, current approaches are unable to include…

Computer Vision and Pattern Recognition · Computer Science 2020-08-05 Oleksii Sidorov , Ronghang Hu , Marcus Rohrbach , Amanpreet Singh

CapWAP: Captioning with a Purpose

The traditional image captioning task uses generic reference captions to provide textual information about images. Different user populations, however, will care about different visual aspects of images. In this paper, we propose a new…

Computation and Language · Computer Science 2020-11-10 Adam Fisch , Kenton Lee , Ming-Wei Chang , Jonathan H. Clark , Regina Barzilay

Controllable Image Captioning

State-of-the-art image captioners can generate accurate sentences to describe images in a sequence to sequence manner without considering the controllability and interpretability. This, however, is far from making image captioning widely…

Computer Vision and Pattern Recognition · Computer Science 2022-05-26 Luka Maxwell

QACE: Asking Questions to Evaluate an Image Caption

In this paper, we propose QACE, a new metric based on Question Answering for Caption Evaluation. QACE generates questions on the evaluated caption and checks its content by asking the questions on either the reference caption or the source…

Computation and Language · Computer Science 2021-08-31 Hwanhee Lee , Thomas Scialom , Seunghyun Yoon , Franck Dernoncourt , Kyomin Jung

FlexCap: Describe Anything in Images in Controllable Detail

We introduce FlexCap, a vision-language model that generates region-specific descriptions of varying lengths. FlexCap is trained to produce length-conditioned captions for input boxes, enabling control over information density, with…

Computer Vision and Pattern Recognition · Computer Science 2025-01-30 Debidatta Dwibedi , Vidhi Jain , Jonathan Tompson , Andrew Zisserman , Yusuf Aytar

Video Captioning with Guidance of Multimodal Latent Topics

The topic diversity of open-domain videos leads to various vocabularies and linguistic expressions in describing video contents, and therefore, makes the video captioning task even more challenging. In this paper, we propose an unified…

Computer Vision and Pattern Recognition · Computer Science 2023-02-15 Shizhe Chen , Jia Chen , Qin Jin , Alexander Hauptmann

Context-Aware Group Captioning via Self-Attention and Contrastive Features

While image captioning has progressed rapidly, existing works focus mainly on describing single images. In this paper, we introduce a new task, context-aware group captioning, which aims to describe a group of target images in the context…

Computer Vision and Pattern Recognition · Computer Science 2020-04-09 Zhuowan Li , Quan Tran , Long Mai , Zhe Lin , Alan Yuille

Partially-Supervised Image Captioning

Image captioning models are becoming increasingly successful at describing the content of images in restricted domains. However, if these models are to function in the wild - for example, as assistants for people with impaired vision - a…

Computer Vision and Pattern Recognition · Computer Science 2018-11-29 Peter Anderson , Stephen Gould , Mark Johnson

Text Data-Centric Image Captioning with Interactive Prompts

Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data. Recently, large-scale vision and language models (e.g., CLIP) and large-scale generative…

Computer Vision and Pattern Recognition · Computer Science 2024-03-29 Yiyu Wang , Hao Luo , Jungang Xu , Yingfei Sun , Fan Wang

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

This paper presents ScaleCap, an inference-time scalable image captioning strategy that generates comprehensive and detailed image captions. The key challenges of high-quality image captioning lie in the inherent biases of LVLMs: multimodal…

Computer Vision and Pattern Recognition · Computer Science 2025-06-25 Long Xing , Qidong Huang , Xiaoyi Dong , Pan Zhang , Yuhang Zang , Yuhang Cao , Jinsong Li , Shuangrui Ding , Weiming Zhang , Nenghai Yu , Jiaqi Wang , Feng Wu , Dahua Lin

Self-Supervised Image Captioning with CLIP

Image captioning, a fundamental task in vision-language understanding, seeks to generate accurate natural language descriptions for provided images. Current image captioning approaches heavily rely on high-quality image-caption pairs, which…

Computer Vision and Pattern Recognition · Computer Science 2023-11-03 Chuanyang Jin