Related papers: Beyond BLEU: A Semantic Evaluation Method for Code…

A Critical Study of Automatic Evaluation in Sign Language Translation

Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably…

Computation and Language · Computer Science 2025-11-17 Shakib Yazdani , Yasser Hamidullah , Cristina España-Bonet , Eleftherios Avramidis , Josef van Genabith

Beyond BLEU: Training Neural Machine Translation with Semantic Similarity

While most neural machine translation (NMT) systems are still trained using maximum likelihood estimation, recent work has demonstrated that optimizing systems to directly improve evaluation metrics such as BLEU can substantially improve…

Computation and Language · Computer Science 2019-09-17 John Wieting , Taylor Berg-Kirkpatrick , Kevin Gimpel , Graham Neubig

Does BLEU Score Work for Code Migration?

Statistical machine translation (SMT) is a fast-growing sub-field of computational linguistics. Until now, the most popular automatic metric to measure the quality of SMT is BiLingual Evaluation Understudy (BLEU) score. Lately, SMT along…

Software Engineering · Computer Science 2019-06-13 Ngoc Tran , Hieu Tran , Son Nguyen , Hoan Nguyen , Tien N. Nguyen

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they…

Software Engineering · Computer Science 2020-09-29 Shuo Ren , Daya Guo , Shuai Lu , Long Zhou , Shujie Liu , Duyu Tang , Neel Sundaresan , Ming Zhou , Ambrosio Blanco , Shuai Ma

BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation

Although neural-based machine translation evaluation metrics, such as COMET or BLEURT, have achieved strong correlations with human judgements, they are sometimes unreliable in detecting certain phenomena that can be considered as critical…

Computation and Language · Computer Science 2023-05-31 Taisiya Glushkova , Chrysoula Zerva , André F. T. Martins

Code to Comment Translation: A Comparative Study on Model Effectiveness & Errors

Automated source code summarization is a popular software engineering research topic wherein machine translation models are employed to "translate" code snippets into relevant natural language descriptions. Most evaluations of such models…

Software Engineering · Computer Science 2021-06-17 Junayed Mahmud , Fahim Faisal , Raihan Islam Arnob , Antonios Anastasopoulos , Kevin Moran

Enhanced Bilingual Evaluation Understudy

Our research extends the Bilingual Evaluation Understudy (BLEU) evaluation technique for statistical machine translation to make it more adjustable and robust. We intend to adapt it to resemble human evaluation more. We perform experiments…

Computation and Language · Computer Science 2015-10-01 Krzysztof Wołk , Krzysztof Marasek

Fine-Grained and Multi-Dimensional Metrics for Document-Level Machine Translation

Large language models (LLMs) have excelled in various NLP tasks, including machine translation (MT), yet most studies focus on sentence-level translation. This work investigates the inherent capability of instruction-tuned LLMs for…

Computation and Language · Computer Science 2025-04-22 Yirong Sun , Dawei Zhu , Yanjun Chen , Erjia Xiao , Xinghao Chen , Xiaoyu Shen

Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models

Large Language Models (LLMs) are increasingly integrated into software engineering workflows, yet current benchmarks provide only coarse performance summaries that obscure the diverse capabilities and limitations of these models. This paper…

Software Engineering · Computer Science 2026-01-21 Felix Mächtle , Jan-Niclas Serr , Nils Loose , Thomas Eisenbarth

DEMETR: Diagnosing Evaluation Metrics for Translation

While machine translation evaluation metrics based on string overlap (e.g., BLEU) have their limitations, their computations are transparent: the BLEU score assigned to a particular candidate translation can be traced back to the presence…

Computation and Language · Computer Science 2022-10-26 Marzena Karpinska , Nishant Raj , Katherine Thai , Yixiao Song , Ankita Gupta , Mohit Iyyer

ICE-Score: Instructing Large Language Models to Evaluate Code

Recent advancements in the field of natural language generation have facilitated the use of large language models to assess the quality of generated text. Although these models have shown promising results in tasks such as machine…

Artificial Intelligence · Computer Science 2024-01-23 Terry Yue Zhuo

Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics

Evaluating machine translation (MT) quality in extremely low-resource language (ELRL) scenarios poses unique challenges, as widely used metrics such as BLEU, effective in high-resource settings, often misrepresent quality in data-scarce…

Computation and Language · Computer Science 2026-02-20 Sanjeev Kumar , Preethi Jyothi , Pushpak Bhattacharyya

Human-Like Code Quality Evaluation through LLM-based Recursive Semantic Comprehension

Code quality evaluation involves scoring generated code quality based on a reference code for a specific problem statement. Currently, there are two main forms of evaluating code quality: match-based evaluation and execution-based…

Software Engineering · Computer Science 2024-12-03 Fangzhou Xu , Sai Zhang , Zhenchang Xing , Xiaowang Zhang , Yahong Han , Zhiyong Feng

Deep Assessment of Code Review Generation Approaches: Beyond Lexical Similarity

Code review is a standard practice for ensuring the quality of software projects, and recent research has focused extensively on automated code review. While significant advancements have been made in generating code reviews, the automated…

Software Engineering · Computer Science 2025-01-10 Yanjie Jiang , Hui Liu , Tianyi Chen , Fu Fan , Chunhao Dong , Kui Liu , Lu Zhang

CodeScore: Evaluating Code Generation by Learning Code Execution

A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) suffer from…

Software Engineering · Computer Science 2024-09-06 Yihong Dong , Jiazheng Ding , Xue Jiang , Ge Li , Zhuo Li , Zhi Jin

A Call for Clarity in Reporting BLEU Scores

The field of machine translation faces an under-recognized problem because of inconsistency in the reporting of scores from its dominant metric. Although people refer to "the" BLEU score, BLEU is in fact a parameterized metric whose values…

Computation and Language · Computer Science 2018-09-13 Matt Post

Reasoning before Comparison: LLM-Enhanced Semantic Similarity Metrics for Domain Specialized Text Analysis

In this study, we leverage LLM to enhance the semantic analysis and develop similarity metrics for texts, addressing the limitations of traditional unsupervised NLP metrics like ROUGE and BLEU. We develop a framework where LLMs such as…

Computation and Language · Computer Science 2024-02-22 Shaochen Xu , Zihao Wu , Huaqin Zhao , Peng Shu , Zhengliang Liu , Wenxiong Liao , Sheng Li , Andrea Sikora , Tianming Liu , Xiang Li

Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation

LLMs show strong performance in code generation, but their outputs lack correctness guarantees. Sample-based uncertainty estimators address this by generating multiple candidate programs and measuring their disagreement. However, existing…

Software Engineering · Computer Science 2026-05-12 Weilin He , Arindam Sharma , Cristina David

How Small Transformation Expose the Weakness of Semantic Similarity Measures

This research examines how well different methods measure semantic similarity, which is important for various software engineering applications such as code search, API recommendations, automated code reviews, and refactoring tools. While…

Computation and Language · Computer Science 2025-09-15 Serge Lionel Nikiema , Albérick Euraste Djire , Abdoul Aziz Bonkoungou , Micheline Bénédicte Moumoula , Jordan Samhi , Abdoul Kader Kabore , Jacques Klein , Tegawendé F. Bissyande

Decoding and Diversity in Machine Translation

Neural Machine Translation (NMT) systems are typically evaluated using automated metrics that assess the agreement between generated translations and ground truth candidates. To improve systems with respect to these metrics, NLP researchers…

Computation and Language · Computer Science 2020-11-30 Nicholas Roberts , Davis Liang , Graham Neubig , Zachary C. Lipton