Related papers: QuestEval: Summarization Asks for Fact-based Evalu…

Rethinking Automatic Evaluation in Sentence Simplification

Automatic evaluation remains an open research question in Natural Language Generation. In the context of Sentence Simplification, this is particularly challenging: the task requires by nature to replace complex words with simpler ones that…

Computation and Language · Computer Science 2021-04-19 Thomas Scialom , Louis Martin , Jacopo Staiano , Éric Villemonte de la Clergerie , Benoît Sagot

Understanding the Extent to which Summarization Evaluation Metrics Measure the Information Quality of Summaries

Reference-based metrics such as ROUGE or BERTScore evaluate the content quality of a summary by comparing the summary to a reference. Ideally, this comparison should measure the summary's information quality by calculating how much…

Computation and Language · Computer Science 2020-10-26 Daniel Deutsch , Dan Roth

Answers Unite! Unsupervised Metrics for Reinforced Summarization Models

Abstractive summarization approaches based on Reinforcement Learning (RL) have recently been proposed to overcome classical likelihood maximization. RL enables to consider complex, possibly non-differentiable, metrics that globally assess…

Computation and Language · Computer Science 2019-09-05 Thomas Scialom , Sylvain Lamprier , Benjamin Piwowarski , Jacopo Staiano

Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary

A desirable property of a reference-based evaluation metric that measures the content quality of a summary is that it should estimate how much information that summary has in common with a reference. Traditional text overlap based metrics…

Computation and Language · Computer Science 2021-07-28 Daniel Deutsch , Tania Bedrax-Weiss , Dan Roth

A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods

The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is unclear how precise these correlation estimates are,…

Computation and Language · Computer Science 2021-07-28 Daniel Deutsch , Rotem Dror , Dan Roth

OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization

Opinion summarization sets itself apart from other types of summarization tasks due to its distinctive focus on aspects and sentiments. Although certain automated evaluation methods like ROUGE have gained popularity, we have found them to…

Computation and Language · Computer Science 2023-11-14 Yuchen Shen , Xiaojun Wan

Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics

How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations. We identify two ways in which the definition of the system-level correlation is inconsistent…

Computation and Language · Computer Science 2022-04-22 Daniel Deutsch , Rotem Dror , Dan Roth

How Far are We from Robust Long Abstractive Summarization?

Abstractive summarization has made tremendous progress in recent years. In this work, we perform fine-grained human annotations to evaluate long document abstractive summarization systems (i.e., models and metrics) with the aim of…

Computation and Language · Computer Science 2022-11-01 Huan Yee Koh , Jiaxin Ju , He Zhang , Ming Liu , Shirui Pan

SueNes: A Weakly Supervised Approach to Evaluating Single-Document Summarization via Negative Sampling

Canonical automatic summary evaluation metrics, such as ROUGE, focus on lexical similarity which cannot well capture semantics nor linguistic quality and require a reference summary which is costly to obtain. Recently, there have been a…

Computation and Language · Computer Science 2022-05-06 Forrest Sheng Bao , Hebi Li , Ge Luo , Minghui Qiu , Yinfei Yang , Youbiao He , Cen Chen

Better Summarization Evaluation with Word Embeddings for ROUGE

ROUGE is a widely adopted, automatic evaluation measure for text summarization. While it has been shown to correlate well with human judgements, it is biased towards surface lexical similarities. This makes it unsuitable for the evaluation…

Computation and Language · Computer Science 2015-08-26 Jun-Ping Ng , Viktoria Abrecht

EVA-Score: Evaluating Abstractive Long-form Summarization on Informativeness through Extraction and Validation

Since LLMs emerged, more attention has been paid to abstractive long-form summarization, where longer input sequences indicate more information contained. Nevertheless, the automatic evaluation of such summaries remains underexplored. The…

Computation and Language · Computer Science 2026-01-30 Yuchen Fan , Yazhe Wan , Xin Zhong , Haonan Cheng , Ning Ding , Bowen Zhou

References Matter: Investigating the Impact of Reference Set Variation on Summarization Evaluation

Human language production exhibits remarkable richness and variation, reflecting diverse communication styles and intents. However, this variation is often overlooked in summarization evaluation. While having multiple reference summaries is…

Computation and Language · Computer Science 2025-09-17 Silvia Casola , Yang Janet Liu , Siyao Peng , Oliver Kraus , Albert Gatt , Barbara Plank

ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks

Evaluation of summarization tasks is extremely crucial to determining the quality of machine generated summaries. Over the last decade, ROUGE has become the standard automatic evaluation measure for evaluating summarization tasks. While…

Information Retrieval · Computer Science 2018-03-07 Kavita Ganesan

LongSumEval: Question-Answering Based Evaluation and Feedback-Driven Refinement for Long Document Summarization

Evaluating long document summaries remains the primary bottleneck in summarization research. Existing metrics correlate weakly with human judgments and produce aggregate scores without explaining deficiencies or guiding improvement,…

Computation and Language · Computer Science 2026-04-29 Huyen Nguyen , Haoxuan Zhang , Yang Zhang , Haihua Chen , Junhua Ding

SummEval: Re-evaluating Summarization Evaluation

The scarcity of comprehensive up-to-date studies on evaluation metrics for text summarization and the lack of consensus regarding evaluation protocols continue to inhibit progress. We address the existing shortcomings of summarization…

Computation and Language · Computer Science 2021-02-03 Alexander R. Fabbri , Wojciech Kryściński , Bryan McCann , Caiming Xiong , Richard Socher , Dragomir Radev

Evaluate Summarization in Fine-Granularity: Auto Evaluation with LLM

Due to the exponential growth of information and the need for efficient information consumption the task of summarization has gained paramount importance. Evaluating summarization accurately and objectively presents significant challenges,…

Computation and Language · Computer Science 2024-12-31 Dong Yuan , Eti Rastogi , Fen Zhao , Sagar Goyal , Gautam Naik , Sree Prasanna Rajagopal

AllSummedUp: un framework open-source pour comparer les metriques d'evaluation de resume

This paper investigates reproducibility challenges in automatic text summarization evaluation. Based on experiments conducted across six representative metrics ranging from classical approaches like ROUGE to recent LLM-based methods…

Computation and Language · Computer Science 2025-09-01 Tanguy Herserant , Vincent Guigue

Learning by Semantic Similarity Makes Abstractive Summarization Better

By harnessing pre-trained language models, summarization models had rapid progress recently. However, the models are mainly assessed by automatic evaluation metrics such as ROUGE. Although ROUGE is known for having a positive correlation…

Computation and Language · Computer Science 2021-06-03 Wonjin Yoon , Yoon Sun Yeo , Minbyul Jeong , Bong-Jun Yi , Jaewoo Kang

QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization

Factual consistency is an essential quality of text summarization models in practical settings. Existing work in evaluating this dimension can be broadly categorized into two lines of research, entailment-based and question answering…

Computation and Language · Computer Science 2022-05-02 Alexander R. Fabbri , Chien-Sheng Wu , Wenhao Liu , Caiming Xiong

Re-evaluating Evaluation in Text Summarization

Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not -- for…

Computation and Language · Computer Science 2020-10-15 Manik Bhandari , Pranav Gour , Atabak Ashfaq , Pengfei Liu , Graham Neubig