Related papers: Learning to Evaluate Image Captioning

Towards Unique and Informative Captioning of Images

Despite considerable progress, state of the art image captioning models produce generic captions, leaving out important image details. Furthermore, these systems may even misrepresent the image in order to produce a simpler caption…

Computer Vision and Pattern Recognition · Computer Science 2020-09-10 Zeyu Wang , Berthy Feng , Karthik Narasimhan , Olga Russakovsky

LCEval: Learned Composite Metric for Caption Evaluation

Automatic evaluation metrics hold a fundamental importance in the development and fine-grained analysis of captioning systems. While current evaluation metrics tend to achieve an acceptable correlation with human judgements at the system…

Artificial Intelligence · Computer Science 2020-12-25 Naeha Sharif , Lyndon White , Mohammed Bennamoun , Wei Liu , Syed Afaq Ali Shah

VCRScore: Image captioning metric based on V\&L Transformers, CLIP, and precision-recall

Image captioning has become an essential Vision & Language research task. It is about predicting the most accurate caption given a specific image or video. The research community has achieved impressive results by continuously proposing new…

Computer Vision and Pattern Recognition · Computer Science 2025-01-28 Guillermo Ruiz , Tania Ramírez , Daniela Moctezuma

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into…

Computer Vision and Pattern Recognition · Computer Science 2024-07-31 Sara Sarto , Marcella Cornia , Lorenzo Baraldi , Rita Cucchiara

Describing like humans: on diversity in image captioning

Recently, the state-of-the-art models for image captioning have overtaken human performance based on the most popular metrics, such as BLEU, METEOR, ROUGE, and CIDEr. Does this mean we have solved the task of image captioning? The above…

Computer Vision and Pattern Recognition · Computer Science 2019-05-16 Qingzhong Wang , Antoni B. Chan

SPICE: Semantic Propositional Image Caption Evaluation

There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor…

Computer Vision and Pattern Recognition · Computer Science 2016-08-01 Peter Anderson , Basura Fernando , Mark Johnson , Stephen Gould

Intrinsic Image Captioning Evaluation

The image captioning task is about to generate suitable descriptions from images. For this task there can be several challenges such as accuracy, fluency and diversity. However there are few metrics that can cover all these properties while…

Computer Vision and Pattern Recognition · Computer Science 2020-12-15 Chao Zeng , Sam Kwong

Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets

A wide range of image captioning models has been developed, achieving significant improvement based on popular metrics, such as BLEU, CIDEr, and SPICE. However, although the generated captions can accurately describe the image, they are…

Computer Vision and Pattern Recognition · Computer Science 2020-09-30 Jiuniu Wang , Wenjia Xu , Qingzhong Wang , Antoni B. Chan

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis

The task of image captioning has recently been gaining popularity, and with it the complex task of evaluating the quality of image captioning models. In this work, we present the first survey and taxonomy of over 70 different image…

Computation and Language · Computer Science 2025-09-16 Uri Berger , Gabriel Stanovsky , Omri Abend , Lea Frermann

On Distinctive Image Captioning via Comparing and Reweighting

Recent image captioning models are achieving impressive results based on popular metrics, i.e., BLEU, CIDEr, and SPICE. However, focusing on the most popular metrics that only consider the overlap between the generated captions and human…

Computer Vision and Pattern Recognition · Computer Science 2022-04-11 Jiuniu Wang , Wenjia Xu , Qingzhong Wang , Antoni B. Chan

Improved Image Captioning via Policy Gradient optimization of SPIDEr

Current image captioning methods are usually trained via (penalized) maximum likelihood estimation. However, the log-likelihood score of a caption does not correlate well with human assessments of quality. Standard syntactic evaluation…

Computer Vision and Pattern Recognition · Computer Science 2018-03-14 Siqi Liu , Zhenhai Zhu , Ning Ye , Sergio Guadarrama , Kevin Murphy

Contrastive Semantic Similarity Learning for Image Captioning Evaluation with Intrinsic Auto-encoder

Automatically evaluating the quality of image captions can be very challenging since human language is quite flexible that there can be various expressions for the same meaning. Most of the current captioning metrics rely on token level…

Computer Vision and Pattern Recognition · Computer Science 2021-06-30 Chao Zeng , Tiesong Zhao , Sam Kwong

BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model

Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve…

Computer Vision and Pattern Recognition · Computer Science 2026-05-22 Gonçalo Gomes , Bruno Martins , Chrysoula Zerva

JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures for Image Captioning Models

Image captioning studies heavily rely on automatic evaluation metrics such as BLEU and METEOR. However, such n-gram-based metrics have been shown to correlate poorly with human evaluation, leading to the proposal of alternative metrics such…

Computer Vision and Pattern Recognition · Computer Science 2023-11-08 Yuiga Wada , Kanta Kaneda , Komei Sugiura

CIDEr: Consensus-based Image Description Evaluation

Automatically describing an image with a sentence is a long-standing challenge in computer vision and natural language processing. Due to recent progress in object detection, attribute classification, action recognition, etc., there is…

Computer Vision and Pattern Recognition · Computer Science 2015-06-04 Ramakrishna Vedantam , C. Lawrence Zitnick , Devi Parikh

A Novel Evaluation Framework for Image2Text Generation

Evaluating the quality of automatically generated image descriptions is challenging, requiring metrics that capture various aspects such as grammaticality, coverage, correctness, and truthfulness. While human evaluation offers valuable…

Computer Vision and Pattern Recognition · Computer Science 2024-08-06 Jia-Hong Huang , Hongyi Zhu , Yixian Shen , Stevan Rudinac , Alessio M. Pacces , Evangelos Kanoulas

Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation

The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures. In this paper, we propose a new recipe for a…

Computer Vision and Pattern Recognition · Computer Science 2023-07-21 Sara Sarto , Manuele Barraco , Marcella Cornia , Lorenzo Baraldi , Rita Cucchiara

TIGEr: Text-to-Image Grounding for Image Caption Evaluation

This paper presents a new metric called TIGEr for the automatic evaluation of image captioning systems. Popular metrics, such as BLEU and CIDEr, are based solely on text matching between reference captions and machine-generated captions,…

Computation and Language · Computer Science 2019-09-06 Ming Jiang , Qiuyuan Huang , Lei Zhang , Xin Wang , Pengchuan Zhang , Zhe Gan , Jana Diesner , Jianfeng Gao

CLAIR: Evaluating Image Captions with Large Language Models

The evaluation of machine-generated image captions poses an interesting yet persistent challenge. Effective evaluation measures must consider numerous dimensions of similarity, including semantic relevance, visual structure, object…

Computer Vision and Pattern Recognition · Computer Science 2023-10-26 David Chan , Suzanne Petryk , Joseph E. Gonzalez , Trevor Darrell , John Canny

ImageCaptioner$^2$: Image Captioner for Image Captioning Bias Amplification Assessment

Most pre-trained learning systems are known to suffer from bias, which typically emerges from the data, the model, or both. Measuring and quantifying bias and its sources is a challenging task and has been extensively studied in image…

Computer Vision and Pattern Recognition · Computer Science 2023-06-07 Eslam Mohamed Bakr , Pengzhan Sun , Li Erran Li , Mohamed Elhoseiny