Related papers: SuperCap: Multi-resolution Superpixel-based Image …

Self-Supervised Image Captioning with CLIP

Image captioning, a fundamental task in vision-language understanding, seeks to generate accurate natural language descriptions for provided images. Current image captioning approaches heavily rely on high-quality image-caption pairs, which…

Computer Vision and Pattern Recognition · Computer Science 2023-11-03 Chuanyang Jin

Partially-Supervised Image Captioning

Image captioning models are becoming increasingly successful at describing the content of images in restricted domains. However, if these models are to function in the wild - for example, as assistants for people with impaired vision - a…

Computer Vision and Pattern Recognition · Computer Science 2018-11-29 Peter Anderson , Stephen Gould , Mark Johnson

Image Embedding Sampling Method for Diverse Captioning

Image Captioning for state-of-the-art VLMs has significantly improved over time; however, this comes at the cost of increased computational complexity, making them less accessible for resource-constrained applications such as mobile devices…

Computer Vision and Pattern Recognition · Computer Science 2025-09-05 Sania Waheed , Na Min An

A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling

When captioning an image, people describe objects in diverse ways, such as by using different terms and/or including details that are perceptually noteworthy to them. Descriptions can be especially unique across languages and cultures.…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Kyle Buettner , Jacob T. Emmerson , Adriana Kovashka

Generating Accurate and Detailed Captions for High-Resolution Images

Vision-language models (VLMs) often struggle to generate accurate and detailed captions for high-resolution images since they are typically pre-trained on low-resolution inputs (e.g., 224x224 or 336x336 pixels). Downscaling high-resolution…

Computer Vision and Pattern Recognition · Computer Science 2025-11-03 Hankyeol Lee , Gawon Seo , Kyounggyu Lee , Dogun Kim , Kyungwoo Song , Jiyoung Jung

Benchmarking and Improving Detail Image Caption

Image captioning has long been regarded as a fundamental task in visual understanding. Recently, however, few large vision-language model (LVLM) research discusses model's image captioning performance because of the outdated short-caption…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Hongyuan Dong , Jiawen Li , Bohong Wu , Jiacong Wang , Yuan Zhang , Haoyuan Guo

Open-Vocabulary Object Detection using Pseudo Caption Labels

Recent open-vocabulary detection methods aim to detect novel objects by distilling knowledge from vision-language models (VLMs) trained on a vast amount of image-text pairs. To improve the effectiveness of these methods, researchers have…

Computer Vision and Pattern Recognition · Computer Science 2023-03-24 Han-Cheol Cho , Won Young Jhoo , Wooyoung Kang , Byungseok Roh

EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension

Large language models (LLMs)-based image captioning has the capability of describing objects not explicitly observed in training data; yet novel objects occur frequently, necessitating the requirement of sustaining up-to-date object…

Computer Vision and Pattern Recognition · Computer Science 2024-04-09 Jiaxuan Li , Duc Minh Vo , Akihiro Sugimoto , Hideki Nakayama

A Semi-supervised Framework for Image Captioning

State-of-the-art approaches for image captioning require supervised training data consisting of captions with paired image data. These methods are typically unable to use unsupervised data such as textual data with no corresponding images,…

Computer Vision and Pattern Recognition · Computer Science 2017-06-27 Wenhu Chen , Aurelien Lucchi , Thomas Hofmann

CapText: Large Language Model-based Caption Generation From Image Context and Description

While deep-learning models have been shown to perform well on image-to-text datasets, it is difficult to use them in practice for captioning images. This is because captions traditionally tend to be context-dependent and offer complementary…

Machine Learning · Computer Science 2023-06-07 Shinjini Ghosh , Sagnik Anupam

Dense Captioning with Joint Inference and Visual Context

Dense captioning is a newly emerging computer vision topic for understanding images with dense language descriptions. The goal is to densely detect visual concepts (e.g., objects, object parts, and interactions between them) from images,…

Computer Vision and Pattern Recognition · Computer Science 2017-08-09 Linjie Yang , Kevin Tang , Jianchao Yang , Li-Jia Li

Beam-Guided Knowledge Replay for Knowledge-Rich Image Captioning using Vision-Language Model

Generating informative and knowledge-rich image captions remains a challenge for many existing captioning models, which often produce generic descriptions that lack specificity and contextual depth. To address this limitation, we propose…

Computer Vision and Pattern Recognition · Computer Science 2025-05-30 Reem AlJunaid , Muzammil Behzad

CaMEL: Mean Teacher Learning for Image Captioning

Describing images in natural language is a fundamental step towards the automatic modeling of connections between the visual and textual modalities. In this paper we present CaMEL, a novel Transformer-based architecture for image…

Computer Vision and Pattern Recognition · Computer Science 2022-02-23 Manuele Barraco , Matteo Stefanini , Marcella Cornia , Silvia Cascianelli , Lorenzo Baraldi , Rita Cucchiara

VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning

It is highly desirable yet challenging to generate image captions that can describe novel objects which are unseen in caption-labeled training data, a capability that is evaluated in the novel object captioning challenge (nocaps). In this…

Computer Vision and Pattern Recognition · Computer Science 2021-03-08 Xiaowei Hu , Xi Yin , Kevin Lin , Lijuan Wang , Lei Zhang , Jianfeng Gao , Zicheng Liu

Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction

Given the accelerating progress of vision and language modeling, accurate evaluation of machine-generated image captions remains critical. In order to evaluate captions more closely to human preferences, metrics need to discriminate between…

Computer Vision and Pattern Recognition · Computer Science 2024-02-29 Koki Maeda , Shuhei Kurita , Taiki Miyanishi , Naoaki Okazaki

VLRM: Vision-Language Models act as Reward Models for Image Captioning

In this work, we present an unsupervised method for enhancing an image captioning model (in our case, BLIP2) using reinforcement learning and vision-language models like CLIP and BLIP2-ITM as reward models. The RL-tuned model is able to…

Computer Vision and Pattern Recognition · Computer Science 2024-04-03 Maksim Dzabraev , Alexander Kunitsyn , Andrei Ivaniuta

From Captions to Visual Concepts and Back

This paper presents a novel approach for automatically generating image descriptions: visual detectors, language models, and multimodal similarity models learnt directly from a dataset of image captions. We use multiple instance learning to…

Computer Vision and Pattern Recognition · Computer Science 2016-02-22 Hao Fang , Saurabh Gupta , Forrest Iandola , Rupesh Srivastava , Li Deng , Piotr Dollár , Jianfeng Gao , Xiaodong He , Margaret Mitchell , John C. Platt , C. Lawrence Zitnick , Geoffrey Zweig

Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning

Generating detailed captions comprehending text-rich visual content in images has received growing attention for Large Vision-Language Models (LVLMs). However, few studies have developed benchmarks specifically tailored for detailed…

Computer Vision and Pattern Recognition · Computer Science 2025-03-27 Fan Lu , Wei Wu , Kecheng Zheng , Shuailei Ma , Biao Gong , Jiawei Liu , Wei Zhai , Yang Cao , Yujun Shen , Zheng-Jun Zha

VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

Visual captioning requires models to capture visual content faithfully while minimizing both omission and hallucination. As the dominant paradigm for captioning, MLLMs have achieved strong performance through scaling and high-quality data.…

Computer Vision and Pattern Recognition · Computer Science 2026-05-28 Xingyu Lu , Jinpeng Wang , Yi-Fan Zhang , Yankai Yang , Yancheng Long , Yiyang Fan , Xuanyu Zheng , Haonan Fan , Kaiyu Jiang , Tianke Zhang , Changyi Liu , Bin Wen , Fan Yang , Tingting Gao , Han Li , Chun Yuan

Visually-Aware Context Modeling for News Image Captioning

News Image Captioning aims to create captions from news articles and images, emphasizing the connection between textual context and visual elements. Recognizing the significance of human faces in news images and the face-name co-occurrence…

Computer Vision and Pattern Recognition · Computer Science 2024-03-22 Tingyu Qu , Tinne Tuytelaars , Marie-Francine Moens