Related papers: Learning Evaluation Models from Large Language Mod…

Automated Evaluation of Personalized Text Generation using Large Language Models

Personalized text generation presents a specialized mechanism for delivering content that is specific to a user's personal context. While the research progress in this area has been rapid, evaluation still presents a challenge. Traditional…

Computation and Language · Computer Science 2023-10-19 Yaqing Wang , Jiepu Jiang , Mingyang Zhang , Cheng Li , Yi Liang , Qiaozhu Mei , Michael Bendersky

Sequence Level Training with Recurrent Neural Networks

Many natural language processing applications use language models to generate text. These models are typically trained to predict the next word in a sequence, given the previous words and some context such as an image. However, at test time…

Machine Learning · Computer Science 2016-05-10 Marc'Aurelio Ranzato , Sumit Chopra , Michael Auli , Wojciech Zaremba

CERET: Cost-Effective Extrinsic Refinement for Text Generation

Large Language Models (LLMs) are powerful models for generation tasks, but they may not generate good quality outputs in their first attempt. Apart from model fine-tuning, existing approaches to improve prediction accuracy and quality…

Computation and Language · Computer Science 2024-11-05 Jason Cai , Hang Su , Monica Sunkara , Igor Shalyminov , Saab Mansour

SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes

Is it possible to train a general metric for evaluating text generation quality without human annotated ratings? Existing learned metrics either perform unsatisfactorily across text generation tasks or require human ratings for training on…

Computation and Language · Computer Science 2023-07-10 Wenda Xu , Xian Qian , Mingxuan Wang , Lei Li , William Yang Wang

Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis

Is it possible to build a general and automatic natural language generation (NLG) evaluation metric? Existing learned metrics either perform unsatisfactorily or are restricted to tasks where large human rating data is already available. We…

Computation and Language · Computer Science 2022-10-27 Wenda Xu , Yilin Tuan , Yujie Lu , Michael Saxon , Lei Li , William Yang Wang

BLEURT: Learning Robust Metrics for Text Generation

Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (e.g., BLEU and ROUGE) may correlate poorly with human judgments. We propose BLEURT, a learned…

Computation and Language · Computer Science 2020-05-22 Thibault Sellam , Dipanjan Das , Ankur P. Parikh

Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator

The quality of meeting summaries generated by natural language generation (NLG) systems is hard to measure automatically. Established metrics such as ROUGE and BERTScore have a relatively low correlation with human judgments and fail to…

Computation and Language · Computer Science 2025-02-19 Frederic Kirstein , Terry Ruas , Bela Gipp

A New Evaluation Method: Evaluation Data and Metrics for Chinese Grammar Error Correction

As a fundamental task in natural language processing, Chinese Grammatical Error Correction (CGEC) has gradually received widespread attention and become a research hotspot. However, one obvious deficiency for the existing CGEC evaluation…

Computation and Language · Computer Science 2022-05-03 Nankai Lin , Nankai Lin , Xiaotian Lin , Ziyu Yang , Shengyi Jiang

Measuring and Improving Semantic Diversity of Dialogue Generation

Response diversity has become an important criterion for evaluating the quality of open-domain dialogue generation models. However, current evaluation metrics for response diversity often fail to capture the semantic diversity of generated…

Computation and Language · Computer Science 2022-10-25 Seungju Han , Beomsu Kim , Buru Chang

Self-Evaluation Improves Selective Generation in Large Language Models

Safe deployment of large language models (LLMs) may benefit from a reliable method for assessing their generated content to determine when to abstain or to selectively generate. While likelihood-based metrics such as perplexity are widely…

Computation and Language · Computer Science 2023-12-18 Jie Ren , Yao Zhao , Tu Vu , Peter J. Liu , Balaji Lakshminarayanan

CodeScore: Evaluating Code Generation by Learning Code Execution

A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) suffer from…

Software Engineering · Computer Science 2024-09-06 Yihong Dong , Jiazheng Ding , Xue Jiang , Ge Li , Zhuo Li , Zhi Jin

ICE-Score: Instructing Large Language Models to Evaluate Code

Recent advancements in the field of natural language generation have facilitated the use of large language models to assess the quality of generated text. Although these models have shown promising results in tasks such as machine…

Artificial Intelligence · Computer Science 2024-01-23 Terry Yue Zhuo

Zero-Shot Grammar Competency Estimation Using Large Language Model Generated Pseudo Labels

Grammar competency estimation is essential for assessing linguistic proficiency in both written and spoken language; however, the spoken modality presents additional challenges due to its spontaneous, unstructured, and disfluent nature.…

Computation and Language · Computer Science 2025-11-18 Sourya Dipta Das , Shubham Kumar , Kuldeep Yadav

SED: Self-Evaluation Decoding Enhances Large Language Models for Better Generation

Existing Large Language Models (LLMs) generate text through unidirectional autoregressive decoding methods to respond to various user queries. These methods tend to consider token selection in a simple sequential manner, making it easy to…

Computation and Language · Computer Science 2024-05-28 Ziqin Luo , Haixia Han , Haokun Zhao , Guochao Jiang , Chengyu Du , Tingyun Li , Jiaqing Liang , Deqing Yang , Yanghua Xiao

CEM: A Data-Efficient Method for Large Language Models to Continue Evolving From Mistakes

As world knowledge advances and new task schemas emerge, Continual Learning (CL) becomes essential for keeping Large Language Models (LLMs) current and addressing their shortcomings. This process typically involves continual instruction…

Machine Learning · Computer Science 2024-12-17 Haokun Zhao , Haixia Han , Jie Shi , Chengyu Du , Jiaqing Liang , Yanghua Xiao

Self-Taught Evaluators

Model-based evaluation is at the heart of successful model development -- as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human…

Computation and Language · Computer Science 2024-08-09 Tianlu Wang , Ilia Kulikov , Olga Golovneva , Ping Yu , Weizhe Yuan , Jane Dwivedi-Yu , Richard Yuanzhe Pang , Maryam Fazel-Zarandi , Jason Weston , Xian Li

RED-CT: A Systems Design Methodology for Using LLM-labeled Data to Train and Deploy Edge Classifiers for Computational Social Science

Large language models (LLMs) have enhanced our ability to rapidly analyze and classify unstructured natural language data. However, concerns regarding cost, network limitations, and security constraints have posed challenges for their…

Machine Learning · Computer Science 2024-11-05 David Farr , Nico Manzonelli , Iain Cruickshank , Jevin West

xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection

Widely used learned metrics for machine translation evaluation, such as COMET and BLEURT, estimate the quality of a translation hypothesis by providing a single sentence-level score. As such, they offer little insight into translation…

Computation and Language · Computer Science 2023-10-17 Nuno M. Guerreiro , Ricardo Rei , Daan van Stigt , Luisa Coheur , Pierre Colombo , André F. T. Martins

Exploration of Masked and Causal Language Modelling for Text Generation

Large Language Models (LLMs) have revolutionised the field of Natural Language Processing (NLP) and have achieved state-of-the-art performance in practically every task in this field. However, the prevalent approach used in text generation,…

Computation and Language · Computer Science 2024-08-12 Nicolo Micheletti , Samuel Belkadi , Lifeng Han , Goran Nenadic

CREAM: Comparison-Based Reference-Free ELO-Ranked Automatic Evaluation for Meeting Summarization

Large Language Models (LLMs) have spurred interest in automatic evaluation methods for summarization, offering a faster, more cost-effective alternative to human evaluation. However, existing methods often fall short when applied to complex…

Computation and Language · Computer Science 2024-09-18 Ziwei Gong , Lin Ai , Harshsaiprasad Deshpande , Alexander Johnson , Emmy Phung , Zehui Wu , Ahmad Emami , Julia Hirschberg