Related papers: CodeBLEU: a Method for Automatic Evaluation of Cod…

Out of the BLEU: how should we assess quality of the Code Generation models?

In recent years, researchers have created and introduced a significant number of various code generation models. As human evaluation of every new model version is unfeasible, the community adopted automatic evaluation metrics such as BLEU…

Software Engineering · Computer Science 2023-05-11 Mikhail Evtikhiev , Egor Bogomolov , Yaroslav Sokolov , Timofey Bryksin

Beyond BLEU: A Semantic Evaluation Method for Code Translation

Code translation is one of the core capabilities of LLMs. However, evaluating the correctness of translations remains difficult, as commonly used metrics such as BLEU measure only syntactic similarity, disregarding program semantics. We…

Programming Languages · Computer Science 2026-05-08 Julius Näumann , Sven Keidel , Amir Molzam Sharifloo , Mira Mezini

Evaluating Code Summarization Techniques: A New Metric and an Empirical Characterization

Several code summarization techniques have been proposed in the literature to automatically document a code snippet or a function. Ideally, software developers should be involved in assessing the quality of the generated summaries. However,…

Software Engineering · Computer Science 2023-12-27 Antonio Mastropaolo , Matteo Ciniselli , Massimiliano Di Penta , Gabriele Bavota

TeXBLEU: Automatic Metric for Evaluate LaTeX Format

LaTeX is suitable for creating specially formatted documents in science, technology, mathematics, and computer science. Although the use of mathematical expressions in LaTeX format along with language models is increasing, there are no…

Computation and Language · Computer Science 2024-09-16 Kyudan Jung , Nam-Joon Kim , Hyongon Ryu , Sieun Hyeon , Seung-jun Lee , Hyeok-jae Lee

CodeGen-Test: An Automatic Code Generation Model Integrating Program Test Information

Automatic code generation is to generate the program code according to the given natural language description. The current mainstream approach uses neural networks to encode natural language descriptions, and output abstract syntax trees…

Software Engineering · Computer Science 2022-02-16 Maosheng Zhong , Gen Liu , Hongwei Li , Jiangling Kuang , Jinshan Zeng , Mingwen Wang

CodeScore: Evaluating Code Generation by Learning Code Execution

A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) suffer from…

Software Engineering · Computer Science 2024-09-06 Yihong Dong , Jiazheng Ding , Xue Jiang , Ge Li , Zhuo Li , Zhi Jin

ICE-Score: Instructing Large Language Models to Evaluate Code

Recent advancements in the field of natural language generation have facilitated the use of large language models to assess the quality of generated text. Although these models have shown promising results in tasks such as machine…

Artificial Intelligence · Computer Science 2024-01-23 Terry Yue Zhuo

CodeScore-R: An Automated Robustness Metric for Assessing the FunctionalCorrectness of Code Synthesis

Evaluation metrics are crucial in the field of code synthesis. Commonly used code evaluation metrics canbe classified into three types: match-based, semantic-based, and execution-based. Among them, the execution-basedPass@k metric…

Software Engineering · Computer Science 2024-06-12 Guang Yang , Yu Zhou , Xiang Chen , Xiangyu Zhang

On the Evaluation of Neural Code Summarization

Source code summaries are important for program comprehension and maintenance. However, there are plenty of programs with missing, outdated, or mismatched summaries. Recently, deep learning techniques have been exploited to automatically…

Software Engineering · Computer Science 2022-02-14 Ensheng Shi , Yanlin Wang , Lun Du , Junjie Chen , Shi Han , Hongyu Zhang , Dongmei Zhang , Hongbin Sun

Why We Need New Evaluation Metrics for NLG

The majority of NLG evaluation relies on automatic metrics, such as BLEU . In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including…

Computation and Language · Computer Science 2017-09-18 Jekaterina Novikova , Ondřej Dušek , Amanda Cercas Curry , Verena Rieser

Deep Assessment of Code Review Generation Approaches: Beyond Lexical Similarity

Code review is a standard practice for ensuring the quality of software projects, and recent research has focused extensively on automated code review. While significant advancements have been made in generating code reviews, the automated…

Software Engineering · Computer Science 2025-01-10 Yanjie Jiang , Hui Liu , Tianyi Chen , Fu Fan , Chunhao Dong , Kui Liu , Lu Zhang

Can Code Evaluation Metrics Detect Code Plagiarism?

Source Code Plagiarism Detection (SCPD) plays an important role in maintaining fairness and academic integrity in software engineering education. Code Evaluation Metrics (CEMs) are developed for assessing code generation tasks. However, it…

Software Engineering · Computer Science 2026-04-29 Fahad Ebrahim , Mike Joy

BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation

Although neural-based machine translation evaluation metrics, such as COMET or BLEURT, have achieved strong correlations with human judgements, they are sometimes unreliable in detecting certain phenomena that can be considered as critical…

Computation and Language · Computer Science 2023-05-31 Taisiya Glushkova , Chrysoula Zerva , André F. T. Martins

On the Evaluation of Commit Message Generation Models: An Experimental Study

Commit messages are natural language descriptions of code changes, which are important for program understanding and maintenance. However, writing commit messages manually is time-consuming and laborious, especially when the code is updated…

Software Engineering · Computer Science 2021-07-27 Wei Tao , Yanlin Wang , Ensheng Shi , Lun Du , Shi Han , Hongyu Zhang , Dongmei Zhang , Wenqiang Zhang

Integrating Code Metrics into Automated Documentation Generation for Computational Notebooks

Effective code documentation is essential for collaboration, comprehension, and long-term software maintainability, yet developers often neglect it due to its repetitive nature. Automated documentation generation has evolved from heuristic…

Software Engineering · Computer Science 2026-02-10 Mojtaba Mostafavi Ghahfarokhi , Hamed Jahantigh , Alireza Asadi , Abbas Heydarnoori

Does BLEU Score Work for Code Migration?

Statistical machine translation (SMT) is a fast-growing sub-field of computational linguistics. Until now, the most popular automatic metric to measure the quality of SMT is BiLingual Evaluation Understudy (BLEU) score. Lately, SMT along…

Software Engineering · Computer Science 2019-06-13 Ngoc Tran , Hieu Tran , Son Nguyen , Hoan Nguyen , Tien N. Nguyen

Jointly Measuring Diversity and Quality in Text Generation Models

Text generation is an important Natural Language Processing task with various applications. Although several metrics have already been introduced to evaluate the text generation methods, each of them has its own shortcomings. The most…

Machine Learning · Computer Science 2019-05-22 Ehsan Montahaei , Danial Alihosseini , Mahdieh Soleymani Baghshah

Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Automatic metrics are fundamental for the development and evaluation of machine translation systems. Judging whether, and to what extent, automatic metrics concur with the gold standard of human evaluation is not a straightforward problem.…

Computation and Language · Computer Science 2020-06-15 Nitika Mathur , Timothy Baldwin , Trevor Cohn

Evaluating Commit Message Generation: To BLEU Or Not To BLEU?

Commit messages play an important role in several software engineering tasks such as program comprehension and understanding program evolution. However, programmers neglect to write good commit messages. Hence, several Commit Message…

Software Engineering · Computer Science 2022-04-21 Samanta Dey , Venkatesh Vinayakarao , Monika Gupta , Sampath Dechu

SemBleu: A Robust Metric for AMR Parsing Evaluation

Evaluating AMR parsing accuracy involves comparing pairs of AMR graphs. The major evaluation metric, SMATCH (Cai and Knight, 2013), searches for one-to-one mappings between the nodes of two AMRs with a greedy hill-climbing algorithm, which…

Computation and Language · Computer Science 2019-05-31 Linfeng Song , Daniel Gildea