English
Related papers

Related papers: Vision Language Model-based Caption Evaluation Met…

200 papers

Caption quality has emerged as a critical bottleneck in training high-quality text-to-image (T2I) and text-to-video (T2V) generative models. While visual language models (VLMs) are commonly deployed to generate captions from visual data,…

Computer Vision and Pattern Recognition · Computer Science 2026-03-10 Varun Ananth , Baqiao Liu , Haoran Cai

Research in Image Generation has recently made significant progress, particularly boosted by the introduction of Vision-Language models which are able to produce high-quality visual content based on textual inputs. Despite ongoing…

Computer Vision and Pattern Recognition · Computer Science 2023-07-20 Federico Betti , Jacopo Staiano , Lorenzo Baraldi , Lorenzo Baraldi , Rita Cucchiara , Nicu Sebe

Vision-language models (VLMs) excel in various visual benchmarks but are often constrained by the lack of high-quality visual fine-tuning data. To address this challenge, we introduce VisCon-100K, a novel dataset derived from interleaved…

Computation and Language · Computer Science 2025-02-25 Gokul Karthik Kumar , Iheb Chaabane , Kebin Wu

The image captioning task is about to generate suitable descriptions from images. For this task there can be several challenges such as accuracy, fluency and diversity. However there are few metrics that can cover all these properties while…

Computer Vision and Pattern Recognition · Computer Science 2020-12-15 Chao Zeng , Sam Kwong

This paper presents a novel approach for automatically generating image descriptions: visual detectors, language models, and multimodal similarity models learnt directly from a dataset of image captions. We use multiple instance learning to…

Computer Vision and Pattern Recognition · Computer Science 2016-02-22 Hao Fang , Saurabh Gupta , Forrest Iandola , Rupesh Srivastava , Li Deng , Piotr Dollár , Jianfeng Gao , Xiaodong He , Margaret Mitchell , John C. Platt , C. Lawrence Zitnick , Geoffrey Zweig

The ability to integrate context, including perceptual and temporal cues, plays a pivotal role in grounding the meaning of a linguistic utterance. In order to measure to what extent current vision-and-language models master this ability, we…

Computer Vision and Pattern Recognition · Computer Science 2022-11-21 Benno Krojer , Vaibhav Adlakha , Vibhav Vineet , Yash Goyal , Edoardo Ponti , Siva Reddy

Image captioning has long been regarded as a fundamental task in visual understanding. Recently, however, few large vision-language model (LVLM) research discusses model's image captioning performance because of the outdated short-caption…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Hongyuan Dong , Jiawen Li , Bohong Wu , Jiacong Wang , Yuan Zhang , Haoyuan Guo

When captioning an image, people describe objects in diverse ways, such as by using different terms and/or including details that are perceptually noteworthy to them. Descriptions can be especially unique across languages and cultures.…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Kyle Buettner , Jacob T. Emmerson , Adriana Kovashka

While deep-learning models have been shown to perform well on image-to-text datasets, it is difficult to use them in practice for captioning images. This is because captions traditionally tend to be context-dependent and offer complementary…

Machine Learning · Computer Science 2023-06-07 Shinjini Ghosh , Sagnik Anupam

Visual captioning benchmarks have become outdated with the emergence of modern multimodal large language models (MLLMs), as the brief ground-truth sentences and traditional metrics fail to assess detailed captions effectively. While recent…

Computer Vision and Pattern Recognition · Computer Science 2025-11-27 Zhihang Liu , Chen-Wei Xie , Bin Wen , Feiwu Yu , Jixuan Chen , Pandeng Li , Boqiang Zhang , Nianzu Yang , Yinglu Li , Zuan Gao , Yun Zheng , Hongtao Xie

Automatic evaluation metrics hold a fundamental importance in the development and fine-grained analysis of captioning systems. While current evaluation metrics tend to achieve an acceptable correlation with human judgements at the system…

Artificial Intelligence · Computer Science 2020-12-25 Naeha Sharif , Lyndon White , Mohammed Bennamoun , Wei Liu , Syed Afaq Ali Shah

Curation methods for massive vision-language datasets trade off between dataset size and quality. However, even the highest quality of available curated captions are far too short to capture the rich visual detail in an image. To show the…

Computer Vision and Pattern Recognition · Computer Science 2024-06-18 Jack Urbanek , Florian Bordes , Pietro Astolfi , Mary Williamson , Vasu Sharma , Adriana Romero-Soriano

Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve…

Computer Vision and Pattern Recognition · Computer Science 2026-05-22 Gonçalo Gomes , Bruno Martins , Chrysoula Zerva

Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed…

Computer Vision and Pattern Recognition · Computer Science 2025-03-12 Qinghao Ye , Xianhan Zeng , Fu Li , Chunyuan Li , Haoqi Fan

Image captioning aims at automatically generating descriptions of an image in natural language. This is a challenging problem in the field of artificial intelligence that has recently received significant attention in the computer vision…

Computer Vision and Pattern Recognition · Computer Science 2019-04-02 Hassan Maleki Galandouz , Mohsen Ebrahimi Moghaddam , Mehrnoush Shamsfard

Vision-language models (VLMs) often struggle to generate accurate and detailed captions for high-resolution images since they are typically pre-trained on low-resolution inputs (e.g., 224x224 or 336x336 pixels). Downscaling high-resolution…

Computer Vision and Pattern Recognition · Computer Science 2025-11-03 Hankyeol Lee , Gawon Seo , Kyounggyu Lee , Dogun Kim , Kyungwoo Song , Jiyoung Jung

Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the…

Computation and Language · Computer Science 2022-11-11 Michele Cafagna , Kees van Deemter , Albert Gatt

Vision-language models can assess visual context in an image and generate descriptive text. While the generated text may be accurate and syntactically correct, it is often overly general. To address this, recent work has used optical…

Computer Vision and Pattern Recognition · Computer Science 2022-07-12 Wes Robbins , Zanyar Zohourianshahzadi , Jugal Kalita

Image captioning is a multimodal problem that has drawn extensive attention in both the natural language processing and computer vision community. In this paper, we present a novel image captioning architecture to better explore semantics…

Computer Vision and Pattern Recognition · Computer Science 2020-06-23 Zhan Shi , Xu Zhou , Xipeng Qiu , Xiaodan Zhu

This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do…

Computer Vision and Pattern Recognition · Computer Science 2018-07-17 Jianfeng Dong , Xirong Li , Cees G. M. Snoek
‹ Prev 1 2 3 10 Next ›