Related papers: Does BLEU Score Work for Code Migration?

Beyond BLEU: A Semantic Evaluation Method for Code Translation

Code translation is one of the core capabilities of LLMs. However, evaluating the correctness of translations remains difficult, as commonly used metrics such as BLEU measure only syntactic similarity, disregarding program semantics. We…

Programming Languages · Computer Science 2026-05-08 Julius Näumann , Sven Keidel , Amir Molzam Sharifloo , Mira Mezini

A Call for Clarity in Reporting BLEU Scores

The field of machine translation faces an under-recognized problem because of inconsistency in the reporting of scores from its dominant metric. Although people refer to "the" BLEU score, BLEU is in fact a parameterized metric whose values…

Computation and Language · Computer Science 2018-09-13 Matt Post

Enhanced Bilingual Evaluation Understudy

Our research extends the Bilingual Evaluation Understudy (BLEU) evaluation technique for statistical machine translation to make it more adjustable and robust. We intend to adapt it to resemble human evaluation more. We perform experiments…

Computation and Language · Computer Science 2015-10-01 Krzysztof Wołk , Krzysztof Marasek

A Critical Study of Automatic Evaluation in Sign Language Translation

Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably…

Computation and Language · Computer Science 2025-11-17 Shakib Yazdani , Yasser Hamidullah , Cristina España-Bonet , Eleftherios Avramidis , Josef van Genabith

Beyond BLEU: Training Neural Machine Translation with Semantic Similarity

While most neural machine translation (NMT) systems are still trained using maximum likelihood estimation, recent work has demonstrated that optimizing systems to directly improve evaluation metrics such as BLEU can substantially improve…

Computation and Language · Computer Science 2019-09-17 John Wieting , Taylor Berg-Kirkpatrick , Kevin Gimpel , Graham Neubig

BLEU might be Guilty but References are not Innocent

The quality of automatic metrics for machine translation has been increasingly called into question, especially for high-quality systems. This paper demonstrates that, while choice of metric is important, the nature of the references is…

Computation and Language · Computer Science 2020-10-21 Markus Freitag , David Grangier , Isaac Caswell

Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics

Evaluating machine translation (MT) quality in extremely low-resource language (ELRL) scenarios poses unique challenges, as widely used metrics such as BLEU, effective in high-resource settings, often misrepresent quality in data-scarce…

Computation and Language · Computer Science 2026-02-20 Sanjeev Kumar , Preethi Jyothi , Pushpak Bhattacharyya

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they…

Software Engineering · Computer Science 2020-09-29 Shuo Ren , Daya Guo , Shuai Lu , Long Zhou , Shujie Liu , Duyu Tang , Neel Sundaresan , Ming Zhou , Ambrosio Blanco , Shuai Ma

Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers

This paper presents the first large-scale meta-evaluation of machine translation (MT). We annotated MT evaluations conducted in 769 research papers published from 2010 to 2020. Our study shows that practices for automatic MT evaluation have…

Computation and Language · Computer Science 2021-06-30 Benjamin Marie , Atsushi Fujita , Raphael Rubino

DEMETR: Diagnosing Evaluation Metrics for Translation

While machine translation evaluation metrics based on string overlap (e.g., BLEU) have their limitations, their computations are transparent: the BLEU score assigned to a particular candidate translation can be traced back to the presence…

Computation and Language · Computer Science 2022-10-26 Marzena Karpinska , Nishant Raj , Katherine Thai , Yixiao Song , Ankita Gupta , Mohit Iyyer

It's Easier to Translate out of English than into it: Measuring Neural Translation Difficulty by Cross-Mutual Information

The performance of neural machine translation systems is commonly evaluated in terms of BLEU. However, due to its reliance on target language properties and generation, the BLEU metric does not allow an assessment of which translation…

Computation and Language · Computer Science 2020-05-19 Emanuele Bugliarello , Sabrina J. Mielke , Antonios Anastasopoulos , Ryan Cotterell , Naoaki Okazaki

To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

Automatic metrics are commonly used as the exclusive tool for declaring the superiority of one machine translation system's quality over another. The community choice of automatic metric guides research directions and industrial…

Computation and Language · Computer Science 2021-09-15 Tom Kocmi , Christian Federmann , Roman Grundkiewicz , Marcin Junczys-Dowmunt , Hitokazu Matsushita , Arul Menezes

Code to Comment Translation: A Comparative Study on Model Effectiveness & Errors

Automated source code summarization is a popular software engineering research topic wherein machine translation models are employed to "translate" code snippets into relevant natural language descriptions. Most evaluations of such models…

Software Engineering · Computer Science 2021-06-17 Junayed Mahmud , Fahim Faisal , Raihan Islam Arnob , Antonios Anastasopoulos , Kevin Moran

Reward Optimization for Neural Machine Translation with Learned Metrics

Neural machine translation (NMT) models are conventionally trained with token-level negative log-likelihood (NLL), which does not guarantee that the generated translations will be optimized for a selected sequence-level evaluation metric.…

Computation and Language · Computer Science 2021-04-16 Raphael Shu , Kang Min Yoo , Jung-Woo Ha

BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation

Although neural-based machine translation evaluation metrics, such as COMET or BLEURT, have achieved strong correlations with human judgements, they are sometimes unreliable in detecting certain phenomena that can be considered as critical…

Computation and Language · Computer Science 2023-05-31 Taisiya Glushkova , Chrysoula Zerva , André F. T. Martins

On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation

Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual textual similarity. In this paper, we concern ourselves with…

Computation and Language · Computer Science 2020-06-09 Wei Zhao , Goran Glavaš , Maxime Peyrard , Yang Gao , Robert West , Steffen Eger

Bridging LLM-Generated Code and Requirements: Reverse Generation technique and SBC Metric for Developer Insights

The rise of Large Language Models (LLMs) in software engineering, particularly in code generation, has garnered significant attention. However, assessing the quality of AI-generated code remains a challenge due to the inherent complexity of…

Software Engineering · Computer Science 2025-02-13 Ahilan Ayyachamy Nadar Ponnusamy

Style Transfer for Texts: Retrain, Report Errors, Compare with Rewrites

This paper shows that standard assessment methodology for style transfer has several significant problems. First, the standard metrics for style accuracy and semantics preservation vary significantly on different re-runs. Therefore one has…

Computation and Language · Computer Science 2022-11-15 Alexey Tikhonov , Viacheslav Shibaev , Aleksander Nagaev , Aigul Nugmanova , Ivan P. Yamshchikov

Decoding and Diversity in Machine Translation

Neural Machine Translation (NMT) systems are typically evaluated using automated metrics that assess the agreement between generated translations and ground truth candidates. To improve systems with respect to these metrics, NLP researchers…

Computation and Language · Computer Science 2020-11-30 Nicholas Roberts , Davis Liang , Graham Neubig , Zachary C. Lipton

Evaluating Commit Message Generation: To BLEU Or Not To BLEU?

Commit messages play an important role in several software engineering tasks such as program comprehension and understanding program evolution. However, programmers neglect to write good commit messages. Hence, several Commit Message…

Software Engineering · Computer Science 2022-04-21 Samanta Dey , Venkatesh Vinayakarao , Monika Gupta , Sampath Dechu