Related papers: Vision Language Model-based Caption Evaluation Met…

VIVECaption: A Split Approach to Caption Quality Improvement

Caption quality has emerged as a critical bottleneck in training high-quality text-to-image (T2I) and text-to-video (T2V) generative models. While visual language models (VLMs) are commonly deployed to generate captions from visual data,…

Computer Vision and Pattern Recognition · Computer Science 2026-03-10 Varun Ananth , Baqiao Liu , Haoran Cai

Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation

Research in Image Generation has recently made significant progress, particularly boosted by the introduction of Vision-Language models which are able to produce high-quality visual content based on textual inputs. Despite ongoing…

Computer Vision and Pattern Recognition · Computer Science 2023-07-20 Federico Betti , Jacopo Staiano , Lorenzo Baraldi , Lorenzo Baraldi , Rita Cucchiara , Nicu Sebe

VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models

Vision-language models (VLMs) excel in various visual benchmarks but are often constrained by the lack of high-quality visual fine-tuning data. To address this challenge, we introduce VisCon-100K, a novel dataset derived from interleaved…

Computation and Language · Computer Science 2025-02-25 Gokul Karthik Kumar , Iheb Chaabane , Kebin Wu

Intrinsic Image Captioning Evaluation

The image captioning task is about to generate suitable descriptions from images. For this task there can be several challenges such as accuracy, fluency and diversity. However there are few metrics that can cover all these properties while…

Computer Vision and Pattern Recognition · Computer Science 2020-12-15 Chao Zeng , Sam Kwong

From Captions to Visual Concepts and Back

This paper presents a novel approach for automatically generating image descriptions: visual detectors, language models, and multimodal similarity models learnt directly from a dataset of image captions. We use multiple instance learning to…

Computer Vision and Pattern Recognition · Computer Science 2016-02-22 Hao Fang , Saurabh Gupta , Forrest Iandola , Rupesh Srivastava , Li Deng , Piotr Dollár , Jianfeng Gao , Xiaodong He , Margaret Mitchell , John C. Platt , C. Lawrence Zitnick , Geoffrey Zweig

Image Retrieval from Contextual Descriptions

The ability to integrate context, including perceptual and temporal cues, plays a pivotal role in grounding the meaning of a linguistic utterance. In order to measure to what extent current vision-and-language models master this ability, we…

Computer Vision and Pattern Recognition · Computer Science 2022-11-21 Benno Krojer , Vaibhav Adlakha , Vibhav Vineet , Yash Goyal , Edoardo Ponti , Siva Reddy

Benchmarking and Improving Detail Image Caption

Image captioning has long been regarded as a fundamental task in visual understanding. Recently, however, few large vision-language model (LVLM) research discusses model's image captioning performance because of the outdated short-caption…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Hongyuan Dong , Jiawen Li , Bohong Wu , Jiacong Wang , Yuan Zhang , Haoyuan Guo

A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling

When captioning an image, people describe objects in diverse ways, such as by using different terms and/or including details that are perceptually noteworthy to them. Descriptions can be especially unique across languages and cultures.…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Kyle Buettner , Jacob T. Emmerson , Adriana Kovashka

CapText: Large Language Model-based Caption Generation From Image Context and Description

While deep-learning models have been shown to perform well on image-to-text datasets, it is difficult to use them in practice for captioning images. This is because captions traditionally tend to be context-dependent and offer complementary…

Machine Learning · Computer Science 2023-06-07 Shinjini Ghosh , Sagnik Anupam

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

Visual captioning benchmarks have become outdated with the emergence of modern multimodal large language models (MLLMs), as the brief ground-truth sentences and traditional metrics fail to assess detailed captions effectively. While recent…

Computer Vision and Pattern Recognition · Computer Science 2025-11-27 Zhihang Liu , Chen-Wei Xie , Bin Wen , Feiwu Yu , Jixuan Chen , Pandeng Li , Boqiang Zhang , Nianzu Yang , Yinglu Li , Zuan Gao , Yun Zheng , Hongtao Xie

LCEval: Learned Composite Metric for Caption Evaluation

Automatic evaluation metrics hold a fundamental importance in the development and fine-grained analysis of captioning systems. While current evaluation metrics tend to achieve an acceptable correlation with human judgements at the system…

Artificial Intelligence · Computer Science 2020-12-25 Naeha Sharif , Lyndon White , Mohammed Bennamoun , Wei Liu , Syed Afaq Ali Shah

A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions

Curation methods for massive vision-language datasets trade off between dataset size and quality. However, even the highest quality of available curated captions are far too short to capture the rich visual detail in an image. To show the…

Computer Vision and Pattern Recognition · Computer Science 2024-06-18 Jack Urbanek , Florian Bordes , Pietro Astolfi , Mary Williamson , Vasu Sharma , Adriana Romero-Soriano

BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model

Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve…

Computer Vision and Pattern Recognition · Computer Science 2026-05-22 Gonçalo Gomes , Bruno Martins , Chrysoula Zerva

Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning

Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed…

Computer Vision and Pattern Recognition · Computer Science 2025-03-12 Qinghao Ye , Xianhan Zeng , Fu Li , Chunyuan Li , Haoqi Fan

A Weighted Multi-Criteria Decision Making Approach for Image Captioning

Image captioning aims at automatically generating descriptions of an image in natural language. This is a challenging problem in the field of artificial intelligence that has recently received significant attention in the computer vision…

Computer Vision and Pattern Recognition · Computer Science 2019-04-02 Hassan Maleki Galandouz , Mohsen Ebrahimi Moghaddam , Mehrnoush Shamsfard

Generating Accurate and Detailed Captions for High-Resolution Images

Vision-language models (VLMs) often struggle to generate accurate and detailed captions for high-resolution images since they are typically pre-trained on low-resolution inputs (e.g., 224x224 or 336x336 pixels). Downscaling high-resolution…

Computer Vision and Pattern Recognition · Computer Science 2025-11-03 Hankyeol Lee , Gawon Seo , Kyounggyu Lee , Dogun Kim , Kyungwoo Song , Jiyoung Jung

Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions

Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the…

Computation and Language · Computer Science 2022-11-11 Michele Cafagna , Kees van Deemter , Albert Gatt

Towards Multimodal Vision-Language Models Generating Non-Generic Text

Vision-language models can assess visual context in an image and generate descriptive text. While the generated text may be accurate and syntactically correct, it is often overly general. To address this, recent work has used optical…

Computer Vision and Pattern Recognition · Computer Science 2022-07-12 Wes Robbins , Zanyar Zohourianshahzadi , Jugal Kalita

Improving Image Captioning with Better Use of Captions

Image captioning is a multimodal problem that has drawn extensive attention in both the natural language processing and computer vision community. In this paper, we present a novel image captioning architecture to better explore semantics…

Computer Vision and Pattern Recognition · Computer Science 2020-06-23 Zhan Shi , Xu Zhou , Xipeng Qiu , Xiaodan Zhu

Predicting Visual Features from Text for Image and Video Caption Retrieval

This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do…

Computer Vision and Pattern Recognition · Computer Science 2018-07-17 Jianfeng Dong , Xirong Li , Cees G. M. Snoek