Related papers: BARTScore: Evaluating Generated Text as Text Gener…

DATScore: Evaluating Translation with Data Augmented Translations

The rapid development of large pretrained language models has revolutionized not only the field of Natural Language Generation (NLG) but also its evaluation. Inspired by the recent work of BARTScore: a metric leveraging the BART language…

Computation and Language · Computer Science 2022-10-14 Moussa Kamal Eddine , Guokan Shang , Michalis Vazirgiannis

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

The state-of-the-art language model-based automatic metrics, e.g. BARTScore, benefiting from large-scale contextualized pre-training, have been successfully used in a wide range of natural language generation (NLG) tasks, including machine…

Computation and Language · Computer Science 2022-12-21 Qingyu Lu , Liang Ding , Liping Xie , Kanjian Zhang , Derek F. Wong , Dacheng Tao

BERTScore: Evaluating Text Generation with BERT

We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However,…

Computation and Language · Computer Science 2020-02-25 Tianyi Zhang , Varsha Kishore , Felix Wu , Kilian Q. Weinberger , Yoav Artzi

INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback

Automatically evaluating the quality of language generation is critical. Although recent learned metrics show high correlation with human judgement, these metrics can not explain their verdict or associate the scores with defects in…

Computation and Language · Computer Science 2023-10-30 Wenda Xu , Danqing Wang , Liangming Pan , Zhenqiao Song , Markus Freitag , William Yang Wang , Lei Li

FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metricsfor Automatic Text Generation

Fast and reliable evaluation metrics are key to R&D progress. While traditional natural language generation metrics are fast, they are not very reliable. Conversely, new metrics based on large pretrained language models are much more…

Computation and Language · Computer Science 2021-10-19 Moussa Kamal Eddine , Guokan Shang , Antoine J. -P. Tixier , Michalis Vazirgiannis

CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

Since the rise of neural natural-language-to-code models (NL->Code) that can generate long expressions and statements rather than a single next-token, one of the major problems has been reliably evaluating their generated output. In this…

Software Engineering · Computer Science 2023-11-01 Shuyan Zhou , Uri Alon , Sumit Agarwal , Graham Neubig

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their semantics rather than surface forms. In this paper we investigate…

Computation and Language · Computer Science 2019-09-27 Wei Zhao , Maxime Peyrard , Fei Liu , Yang Gao , Christian M. Meyer , Steffen Eger

GPTScore: Evaluate as You Desire

Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless,…

Computation and Language · Computer Science 2023-02-14 Jinlan Fu , See-Kiong Ng , Zhengbao Jiang , Pengfei Liu

Automatic Text Evaluation through the Lens of Wasserstein Barycenters

A new metric \texttt{BaryScore} to evaluate text generation based on deep contextualized embeddings e.g., BERT, Roberta, ELMo) is introduced. This metric is motivated by a new framework relying on optimal transport tools, i.e., Wasserstein…

Computation and Language · Computer Science 2021-09-10 Pierre Colombo , Guillaume Staerman , Chloe Clavel , Pablo Piantanida

SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics

While subjective assessments have been the gold standard for evaluating speech generation, there is a growing need for objective metrics that are highly correlated with human subjective judgments due to their cost efficiency. This paper…

Sound · Computer Science 2024-09-04 Takaaki Saeki , Soumi Maiti , Shinnosuke Takamichi , Shinji Watanabe , Hiroshi Saruwatari

Consistency Evaluation of News Article Summaries Generated by Large (and Small) Language Models

Text summarizing is a critical Natural Language Processing (NLP) task with applications ranging from information retrieval to content generation. Large Language Models (LLMs) have shown remarkable promise in generating fluent abstractive…

Computation and Language · Computer Science 2025-03-03 Colleen Gilhuly , Haleh Shahzad

Perception Score, A Learned Metric for Open-ended Text Generation Evaluation

Automatic evaluation for open-ended natural language generation tasks remains a challenge. Existing metrics such as BLEU show a low correlation with human judgment. We propose a novel and powerful learning-based evaluation metric:…

Computation and Language · Computer Science 2020-08-20 Jing Gu , Qingyang Wu , Zhou Yu

LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores

Automatic evaluation of generated textual content presents an ongoing challenge within the field of NLP. Given the impressive capabilities of modern language models (LMs) across diverse NLP tasks, there is a growing trend to employ these…

Computation and Language · Computer Science 2024-06-10 Yiqi Liu , Nafise Sadat Moosavi , Chenghua Lin

Towards Explainable Evaluation Metrics for Natural Language Generation

Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics (such as BERTScore or MoverScore) are based on black-box language models such as BERT or XLM-R. They often achieve strong correlations with human…

Computation and Language · Computer Science 2022-03-22 Christoph Leiter , Piyawat Lertvittayakumjorn , Marina Fomicheva , Wei Zhao , Yang Gao , Steffen Eger

Enhancing Text Generation in Joint NLG/NLU Learning Through Curriculum Learning, Semi-Supervised Training, and Advanced Optimization Techniques

Text generation is the automated process of producing written or spoken language using computational methods. It involves generating coherent and contextually relevant text based on predefined rules or learned patterns. However, challenges…

Computation and Language · Computer Science 2025-01-30 Rahimanuddin Shaik , Katikela Sreeharsha Kishore

BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation

Natural language processing (NLP) systems are increasingly trained to generate open-ended text rather than classifying between responses. This makes research on evaluation metrics for generated language -- functions that score system output…

Computation and Language · Computer Science 2021-10-19 Thomas Scialom , Felix Hill

QRelScore: Better Evaluating Generated Questions with Deeper Understanding of Context-aware Relevance

Existing metrics for assessing question generation not only require costly human reference but also fail to take into account the input context of generation, rendering the lack of deep understanding of the relevance between the generated…

Computation and Language · Computer Science 2022-05-02 Xiaoqiang Wang , Bang Liu , Siliang Tang , Lingfei Wu

Evaluating Factual Consistency of Texts with Semantic Role Labeling

Automated evaluation of text generation systems has recently seen increasing attention, particularly checking whether generated text stays truthful to input sources. Existing methods frequently rely on an evaluation using task-specific…

Computation and Language · Computer Science 2023-05-23 Jing Fan , Dennis Aumiller , Michael Gertz

Parallel Refinements for Lexically Constrained Text Generation with BART

Lexically constrained text generation aims to control the generated text by incorporating some pre-specified keywords into the output. Previous work injects lexical constraints into the output by controlling the decoding process or refining…

Computation and Language · Computer Science 2021-09-28 Xingwei He

Automatic Construction of Evaluation Suites for Natural Language Generation Datasets

Machine learning approaches applied to NLP are often evaluated by summarizing their performance in a single number, for example accuracy. Since most test sets are constructed as an i.i.d. sample from the overall data, this approach overly…

Computation and Language · Computer Science 2021-06-18 Simon Mille , Kaustubh D. Dhole , Saad Mahamood , Laura Perez-Beltrachini , Varun Gangal , Mihir Kale , Emiel van Miltenburg , Sebastian Gehrmann