Related papers: Difficulty-Aware Machine Translation Evaluation

LEPOR: An Augmented Machine Translation Evaluation Metric

Machine translation (MT) was developed as one of the hottest research topics in the natural language processing (NLP) literature. One important issue in MT is that how to evaluate the MT system reasonably and tell us whether the translation…

Computation and Language · Computer Science 2022-01-25 Lifeng Han

Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date overview

Starting from the 1950s, Machine Translation (MT) was challenged by different scientific solutions, which included rule-based methods, example-based and statistical models (SMT), to hybrid models, and very recent years the neural models…

Computation and Language · Computer Science 2025-08-07 Lifeng Han , Serge Gladkoff

Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics

Machine Translation (MT) evaluation metrics assess translation quality automatically. Recently, researchers have employed MT metrics for various new use cases, such as data filtering and translation re-ranking. However, most MT metrics…

Computation and Language · Computer Science 2024-10-08 Stefano Perrella , Lorenzo Proietti , Pere-Lluís Huguet Cabot , Edoardo Barba , Roberto Navigli

HEVAL: Yet Another Human Evaluation Metric

Machine translation evaluation is a very important activity in machine translation development. Automatic evaluation metrics proposed in literature are inadequate as they require one or more human reference translations to compare them with…

Computation and Language · Computer Science 2013-11-18 Nisheeth Joshi , Iti Mathur , Hemant Darbari , Ajai Kumar

Disentangling Uncertainty in Machine Translation Evaluation

Trainable evaluation metrics for machine translation (MT) exhibit strong correlation with human judgements, but they are often hard to interpret and might produce unreliable scores under noisy or out-of-domain data. Recent work has…

Computation and Language · Computer Science 2022-12-01 Chrysoula Zerva , Taisiya Glushkova , Ricardo Rei , André F. T. Martins

Evaluating Automatic Metrics with Incremental Machine Translation Systems

We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions. Since human A/B testing is commonly used, we assume commercial systems improve over time, which enables us…

Computation and Language · Computer Science 2024-10-04 Guojun Wu , Shay B. Cohen , Rico Sennrich

Online Learning Meets Machine Translation Evaluation: Finding the Best Systems with the Least Human Effort

In Machine Translation, assessing the quality of a large amount of automatic translations can be challenging. Automatic metrics are not reliable when it comes to high performing systems. In addition, resorting to human evaluators can be…

Computation and Language · Computer Science 2021-05-31 Vânia Mendonça , Ricardo Rei , Luisa Coheur , Alberto Sardinha , Ana Lúcia Santos

Uncertainty-Aware Machine Translation Evaluation

Several neural-based metrics have been recently proposed to evaluate machine translation quality. However, all of them resort to point estimates, which provide limited information at segment level. This is made worse as they are trained on…

Computation and Language · Computer Science 2022-03-28 Taisiya Glushkova , Chrysoula Zerva , Ricardo Rei , André F. T. Martins

Estimating Machine Translation Difficulty

Machine translation quality has steadily improved over the years, achieving near-perfect translations in recent benchmarks. These high-quality outputs make it difficult to distinguish between state-of-the-art models and to identify areas…

Computation and Language · Computer Science 2025-08-29 Lorenzo Proietti , Stefano Perrella , Vilém Zouhar , Roberto Navigli , Tom Kocmi

Can Automatic Metrics Assess High-Quality Translations?

Automatic metrics for evaluating translation quality are typically validated by measuring how well they correlate with human assessments. However, correlation methods tend to capture only the ability of metrics to differentiate between good…

Computation and Language · Computer Science 2024-10-11 Sweta Agrawal , António Farinhas , Ricardo Rei , André F. T. Martins

Machine Translation Evaluation using Bi-directional Entailment

In this paper, we propose a new metric for Machine Translation (MT) evaluation, based on bi-directional entailment. We show that machine generated translation can be evaluated by determining paraphrasing with a reference translation…

Computation and Language · Computer Science 2019-11-05 Rakesh Khobragade , Heaven Patel , Anand Namdev , Anish Mishra , Pushpak Bhattacharyya

Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers

This paper presents the first large-scale meta-evaluation of machine translation (MT). We annotated MT evaluations conducted in 769 research papers published from 2010 to 2020. Our study shows that practices for automatic MT evaluation have…

Computation and Language · Computer Science 2021-06-30 Benjamin Marie , Atsushi Fujita , Raphael Rubino

Competency-Aware Neural Machine Translation: Can Machine Translation Know its Own Translation Quality?

Neural machine translation (NMT) is often criticized for failures that happen without awareness. The lack of competency awareness makes NMT untrustworthy. This is in sharp contrast to human translators who give feedback or conduct further…

Computation and Language · Computer Science 2022-11-28 Pei Zhang , Baosong Yang , Haoran Wei , Dayiheng Liu , Kai Fan , Luo Si , Jun Xie

Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress

In Machine Translation (MT) evaluation, metric performance is assessed based on agreement with human judgments. In recent years, automatic metrics have demonstrated increasingly high levels of agreement with humans. To gain a clearer…

Computation and Language · Computer Science 2025-06-25 Lorenzo Proietti , Stefano Perrella , Roberto Navigli

Variance-Aware Machine Translation Test Sets

We release 70 small and discriminative test sets for machine translation (MT) evaluation called variance-aware test sets (VAT), covering 35 translation directions from WMT16 to WMT20 competitions. VAT is automatically created by a novel…

Computation and Language · Computer Science 2021-11-09 Runzhe Zhan , Xuebo Liu , Derek F. Wong , Lidia S. Chao

Machine Translation: A Literature Review

Machine translation (MT) plays an important role in benefiting linguists, sociologists, computer scientists, etc. by processing natural language to translate it into some other natural language. And this demand has grown exponentially over…

Computation and Language · Computer Science 2019-01-07 Ankush Garg , Mayank Agarwal

Learning to Evaluate Translation Beyond English: BLEURT Submissions to the WMT Metrics 2020 Shared Task

The quality of machine translation systems has dramatically improved over the last decade, and as a result, evaluation has become an increasingly challenging problem. This paper describes our contribution to the WMT 2020 Metrics Shared…

Computation and Language · Computer Science 2020-10-21 Thibault Sellam , Amy Pu , Hyung Won Chung , Sebastian Gehrmann , Qijun Tan , Markus Freitag , Dipanjan Das , Ankur P. Parikh

An Overview on Machine Translation Evaluation

Since the 1950s, machine translation (MT) has become one of the important tasks of AI and development, and has experienced several different periods and stages of development, including rule-based methods, statistical methods, and recently…

Computation and Language · Computer Science 2022-02-23 Lifeng Han

Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In!

Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics, ranking them according to their correlation with human judgments. Their results…

Computation and Language · Computer Science 2024-08-27 Stefano Perrella , Lorenzo Proietti , Alessandro Scirè , Edoardo Barba , Roberto Navigli

Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Automatic metrics are fundamental for the development and evaluation of machine translation systems. Judging whether, and to what extent, automatic metrics concur with the gold standard of human evaluation is not a straightforward problem.…

Computation and Language · Computer Science 2020-06-15 Nitika Mathur , Timothy Baldwin , Trevor Cohn