Related papers: Enhanced Bilingual Evaluation Understudy

Beyond BLEU: Training Neural Machine Translation with Semantic Similarity

While most neural machine translation (NMT) systems are still trained using maximum likelihood estimation, recent work has demonstrated that optimizing systems to directly improve evaluation metrics such as BLEU can substantially improve…

Computation and Language · Computer Science 2019-09-17 John Wieting , Taylor Berg-Kirkpatrick , Kevin Gimpel , Graham Neubig

Machine Translation Evaluation using Bi-directional Entailment

In this paper, we propose a new metric for Machine Translation (MT) evaluation, based on bi-directional entailment. We show that machine generated translation can be evaluated by determining paraphrasing with a reference translation…

Computation and Language · Computer Science 2019-11-05 Rakesh Khobragade , Heaven Patel , Anand Namdev , Anish Mishra , Pushpak Bhattacharyya

A Critical Study of Automatic Evaluation in Sign Language Translation

Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably…

Computation and Language · Computer Science 2025-11-17 Shakib Yazdani , Yasser Hamidullah , Cristina España-Bonet , Eleftherios Avramidis , Josef van Genabith

DEMETR: Diagnosing Evaluation Metrics for Translation

While machine translation evaluation metrics based on string overlap (e.g., BLEU) have their limitations, their computations are transparent: the BLEU score assigned to a particular candidate translation can be traced back to the presence…

Computation and Language · Computer Science 2022-10-26 Marzena Karpinska , Nishant Raj , Katherine Thai , Yixiao Song , Ankita Gupta , Mohit Iyyer

BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation

Although neural-based machine translation evaluation metrics, such as COMET or BLEURT, have achieved strong correlations with human judgements, they are sometimes unreliable in detecting certain phenomena that can be considered as critical…

Computation and Language · Computer Science 2023-05-31 Taisiya Glushkova , Chrysoula Zerva , André F. T. Martins

On The Evaluation of Machine Translation Systems Trained With Back-Translation

Back-translation is a widely used data augmentation technique which leverages target monolingual data. However, its effectiveness has been challenged since automatic metrics such as BLEU only show significant improvements for test examples…

Computation and Language · Computer Science 2020-08-19 Sergey Edunov , Myle Ott , Marc'Aurelio Ranzato , Michael Auli

Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers

This paper presents the first large-scale meta-evaluation of machine translation (MT). We annotated MT evaluations conducted in 769 research papers published from 2010 to 2020. Our study shows that practices for automatic MT evaluation have…

Computation and Language · Computer Science 2021-06-30 Benjamin Marie , Atsushi Fujita , Raphael Rubino

Learning to Evaluate Translation Beyond English: BLEURT Submissions to the WMT Metrics 2020 Shared Task

The quality of machine translation systems has dramatically improved over the last decade, and as a result, evaluation has become an increasingly challenging problem. This paper describes our contribution to the WMT 2020 Metrics Shared…

Computation and Language · Computer Science 2020-10-21 Thibault Sellam , Amy Pu , Hyung Won Chung , Sebastian Gehrmann , Qijun Tan , Markus Freitag , Dipanjan Das , Ankur P. Parikh

BLEU might be Guilty but References are not Innocent

The quality of automatic metrics for machine translation has been increasingly called into question, especially for high-quality systems. This paper demonstrates that, while choice of metric is important, the nature of the references is…

Computation and Language · Computer Science 2020-10-21 Markus Freitag , David Grangier , Isaac Caswell

Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Automatic metrics are fundamental for the development and evaluation of machine translation systems. Judging whether, and to what extent, automatic metrics concur with the gold standard of human evaluation is not a straightforward problem.…

Computation and Language · Computer Science 2020-06-15 Nitika Mathur , Timothy Baldwin , Trevor Cohn

Decoding and Diversity in Machine Translation

Neural Machine Translation (NMT) systems are typically evaluated using automated metrics that assess the agreement between generated translations and ground truth candidates. To improve systems with respect to these metrics, NLP researchers…

Computation and Language · Computer Science 2020-11-30 Nicholas Roberts , Davis Liang , Graham Neubig , Zachary C. Lipton

Does BLEU Score Work for Code Migration?

Statistical machine translation (SMT) is a fast-growing sub-field of computational linguistics. Until now, the most popular automatic metric to measure the quality of SMT is BiLingual Evaluation Understudy (BLEU) score. Lately, SMT along…

Software Engineering · Computer Science 2019-06-13 Ngoc Tran , Hieu Tran , Son Nguyen , Hoan Nguyen , Tien N. Nguyen

It's Easier to Translate out of English than into it: Measuring Neural Translation Difficulty by Cross-Mutual Information

The performance of neural machine translation systems is commonly evaluated in terms of BLEU. However, due to its reliance on target language properties and generation, the BLEU metric does not allow an assessment of which translation…

Computation and Language · Computer Science 2020-05-19 Emanuele Bugliarello , Sabrina J. Mielke , Antonios Anastasopoulos , Ryan Cotterell , Naoaki Okazaki

Neural and Statistical Methods for Leveraging Meta-information in Machine Translation

In this paper, we discuss different methods which use meta information and richer context that may accompany source language input to improve machine translation quality. We focus on category information of input text as meta information,…

Computation and Language · Computer Science 2017-08-11 Shahram Khadivi , Patrick Wilken , Leonard Dahlmann , Evgeny Matusov

It is Not as Good as You Think! Evaluating Simultaneous Machine Translation on Interpretation Data

Most existing simultaneous machine translation (SiMT) systems are trained and evaluated on offline translation corpora. We argue that SiMT systems should be trained and tested on real interpretation data. To illustrate this argument, we…

Computation and Language · Computer Science 2021-10-12 Jinming Zhao , Philip Arthur , Gholamreza Haffari , Trevor Cohn , Ehsan Shareghi

Beyond BLEU: A Semantic Evaluation Method for Code Translation

Code translation is one of the core capabilities of LLMs. However, evaluating the correctness of translations remains difficult, as commonly used metrics such as BLEU measure only syntactic similarity, disregarding program semantics. We…

Programming Languages · Computer Science 2026-05-08 Julius Näumann , Sven Keidel , Amir Molzam Sharifloo , Mira Mezini

Assessing Reference-Free Peer Evaluation for Machine Translation

Reference-free evaluation has the potential to make machine translation evaluation substantially more scalable, allowing us to pivot easily to new languages or domains. It has been recently shown that the probabilities given by a large,…

Computation and Language · Computer Science 2021-04-13 Sweta Agrawal , George Foster , Markus Freitag , Colin Cherry

An Effective Approach to Unsupervised Machine Translation

While machine translation has traditionally relied on large amounts of parallel corpora, a recent research line has managed to train both Neural Machine Translation (NMT) and Statistical Machine Translation (SMT) systems using monolingual…

Computation and Language · Computer Science 2021-12-28 Mikel Artetxe , Gorka Labaka , Eneko Agirre

HilMeMe: A Human-in-the-Loop Machine Translation Evaluation Metric Looking into Multi-Word Expressions

With the fast development of Machine Translation (MT) systems, especially the new boost from Neural MT (NMT) models, the MT output quality has reached a new level of accuracy. However, many researchers criticised that the current popular…

Computation and Language · Computer Science 2022-11-11 Lifeng Han

Towards the evaluation of automatic simultaneous speech translation from a communicative perspective

In recent years, automatic speech-to-speech and speech-to-text translation has gained momentum thanks to advances in artificial intelligence, especially in the domains of speech recognition and machine translation. The quality of such…

Computation and Language · Computer Science 2021-07-02 Claudio Fantinuoli , Bianca Prandi