Related papers: Mark-Evaluate: Assessing Language Generation using…

Citation Analysis with Mark-and-Recapture

Mark-and-Recapture is a methodology from Population Biology to estimate the number of a species without counting every individual. This is done by multiple samplings of the species using traps and discounting the instances that were caught…

Digital Libraries · Computer Science 2015-03-24 Chuan Wen Loe , Henrik Jeldtoft Jensen

Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale

Automatic evaluation of language generation systems is a well-studied problem in Natural Language Processing. While novel metrics are proposed every year, a few popular metrics remain as the de facto metrics to evaluate tasks such as image…

Computation and Language · Computer Science 2020-10-27 Ozan Caglayan , Pranava Madhyastha , Lucia Specia

Estimating the observable population size from biased samples: a new approach to population estimation with capture heterogeneity

Capture-recapture methods aim to estimate the size of a closed population on the basis of multiple incomplete enumerations of individuals. In many applications, the individual probability of being recorded is heterogeneous in the…

Methodology · Statistics 2016-06-08 James E. Johndrow , Kristian Lum , Daniel Manrique-Vallier

On the Estimation of Population Size from a Dependent Triple Record System

Population size estimation based on capture-recapture experiment under triple record system is an interesting problem in various fields including epidemiology, population studies, etc. In many real life scenarios, there exists inherent…

Methodology · Statistics 2022-01-04 Kiranmoy Chatterjee , Prajamitra Bhuyan

Measuring and Improving Semantic Diversity of Dialogue Generation

Response diversity has become an important criterion for evaluating the quality of open-domain dialogue generation models. However, current evaluation metrics for response diversity often fail to capture the semantic diversity of generated…

Computation and Language · Computer Science 2022-10-25 Seungju Han , Beomsu Kim , Buru Chang

RankME: Reliable Human Ratings for Natural Language Generation

Human evaluation for natural language generation (NLG) often suffers from inconsistent user ratings. While previous research tends to attribute this problem to individual user preferences, we show that the quality of human judgements can…

Computation and Language · Computer Science 2018-10-03 Jekaterina Novikova , Ondřej Dušek , Verena Rieser

Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices

Automatic metrics are extensively used to evaluate natural language processing systems. However, there has been increasing focus on how they are used and reported by practitioners within the field. In this paper, we have conducted a survey…

Computation and Language · Computer Science 2024-08-20 Patrícia Schmidtová , Saad Mahamood , Simone Balloccu , Ondřej Dušek , Albert Gatt , Dimitra Gkatzia , David M. Howcroft , Ondřej Plátek , Adarsa Sivaprasad

Perception Score, A Learned Metric for Open-ended Text Generation Evaluation

Automatic evaluation for open-ended natural language generation tasks remains a challenge. Existing metrics such as BLEU show a low correlation with human judgment. We propose a novel and powerful learning-based evaluation metric:…

Computation and Language · Computer Science 2020-08-20 Jing Gu , Qingyang Wu , Zhou Yu

RoMe: A Robust Metric for Evaluating Natural Language Generation

Evaluating Natural Language Generation (NLG) systems is a challenging task. Firstly, the metric should ensure that the generated hypothesis reflects the reference's semantics. Secondly, it should consider the grammatical quality of the…

Computation and Language · Computer Science 2022-03-18 Md Rashad Al Hasan Rony , Liubov Kovriguina , Debanjan Chaudhuri , Ricardo Usbeck , Jens Lehmann

Estimating Subjective Crowd-Evaluations as an Additional Objective to Improve Natural Language Generation

Human ratings are one of the most prevalent methods to evaluate the performance of natural language processing algorithms. Similarly, it is common to measure the quality of sentences generated by a natural language generation model using…

Computation and Language · Computer Science 2021-04-13 Jakob Nyberg , Ramesh Manuvinakurike , Maike Paetzel-Prüsmann

Unifying Human and Statistical Evaluation for Natural Language Generation

How can we measure whether a natural language generation system produces both high quality and diverse outputs? Human evaluation captures quality but not diversity, as it does not catch models that simply plagiarize from the training set.…

Computation and Language · Computer Science 2019-04-08 Tatsunori B. Hashimoto , Hugh Zhang , Percy Liang

On the estimation of population size from a post-stratified two sample capture-recapture data under dependence

Population size estimation based on two sample capture-recapture type experiment is an interesting problem in various fields including epidemiology, pubic health, population studies, etc. The Lincoln-Petersen estimate is popularly used…

Methodology · Statistics 2019-01-21 Kiranmoy Chatterjee , Prajamitra Bhuyan

Estimation of population size based on capture recapture designs and evaluation of the estimation reliability

We propose a modern method to estimate population size based on capture-recapture designs of K samples. The observed data is formulated as a sample of n i.i.d. K-dimensional vectors of binary indicators, where the k-th component of each…

Statistics Theory · Mathematics 2021-05-13 Yue You , Mark van der Laan , Philip Collender , Qu Cheng , Alan Hubbard , Nicholas P Jewell , Zhiyue Tom Hu , Robin Mejia , Justin Remais

Jointly Measuring Diversity and Quality in Text Generation Models

Text generation is an important Natural Language Processing task with various applications. Although several metrics have already been introduced to evaluate the text generation methods, each of them has its own shortcomings. The most…

Machine Learning · Computer Science 2019-05-22 Ehsan Montahaei , Danial Alihosseini , Mahdieh Soleymani Baghshah

BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation

Natural language processing (NLP) systems are increasingly trained to generate open-ended text rather than classifying between responses. This makes research on evaluation metrics for generated language -- functions that score system output…

Computation and Language · Computer Science 2021-10-19 Thomas Scialom , Felix Hill

Evaluation of Text Generation: A Survey

The paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years. We group NLG evaluation methods into three categories: (1) human-centric evaluation metrics, (2) automatic…

Computation and Language · Computer Science 2021-05-19 Asli Celikyilmaz , Elizabeth Clark , Jianfeng Gao

Adapting Standard Retrieval Benchmarks to Evaluate Generated Answers

Large language models can now directly generate answers to many factual questions without referencing external sources. Unfortunately, relatively little attention has been paid to methods for evaluating the quality and correctness of these…

Information Retrieval · Computer Science 2024-01-11 Negar Arabzadeh , Amin Bigdeli , Charles L. A. Clarke

Language Model Evaluation in Open-ended Text Generation

Although current state-of-the-art language models have achieved impressive results in numerous natural language processing tasks, still they could not solve the problem of producing repetitive, dull and sometimes inconsistent text in…

Computation and Language · Computer Science 2021-08-10 An Nguyen

Dynamic Human Evaluation for Relative Model Comparisons

Collecting human judgements is currently the most reliable evaluation method for natural language generation systems. Automatic metrics have reported flaws when applied to measure quality aspects of generated text and have been shown to…

Computation and Language · Computer Science 2022-04-29 Thórhildur Thorleiksdóttir , Cedric Renggli , Nora Hollenstein , Ce Zhang

From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars

Recent advances in language modeling have demonstrated significant improvements in zero-shot capabilities, including in-context learning, instruction following, and machine translation for extremely under-resourced languages (Tanzer et al.,…

Computation and Language · Computer Science 2024-12-30 Albert Kornilov , Tatiana Shavrina