Related papers: CodeBLEU: a Method for Automatic Evaluation of Cod…
In recent years, researchers have created and introduced a significant number of various code generation models. As human evaluation of every new model version is unfeasible, the community adopted automatic evaluation metrics such as BLEU…
Code translation is one of the core capabilities of LLMs. However, evaluating the correctness of translations remains difficult, as commonly used metrics such as BLEU measure only syntactic similarity, disregarding program semantics. We…
Several code summarization techniques have been proposed in the literature to automatically document a code snippet or a function. Ideally, software developers should be involved in assessing the quality of the generated summaries. However,…
LaTeX is suitable for creating specially formatted documents in science, technology, mathematics, and computer science. Although the use of mathematical expressions in LaTeX format along with language models is increasing, there are no…
Automatic code generation is to generate the program code according to the given natural language description. The current mainstream approach uses neural networks to encode natural language descriptions, and output abstract syntax trees…
A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) suffer from…
Recent advancements in the field of natural language generation have facilitated the use of large language models to assess the quality of generated text. Although these models have shown promising results in tasks such as machine…
Evaluation metrics are crucial in the field of code synthesis. Commonly used code evaluation metrics canbe classified into three types: match-based, semantic-based, and execution-based. Among them, the execution-basedPass@k metric…
Source code summaries are important for program comprehension and maintenance. However, there are plenty of programs with missing, outdated, or mismatched summaries. Recently, deep learning techniques have been exploited to automatically…
The majority of NLG evaluation relies on automatic metrics, such as BLEU . In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including…
Code review is a standard practice for ensuring the quality of software projects, and recent research has focused extensively on automated code review. While significant advancements have been made in generating code reviews, the automated…
Source Code Plagiarism Detection (SCPD) plays an important role in maintaining fairness and academic integrity in software engineering education. Code Evaluation Metrics (CEMs) are developed for assessing code generation tasks. However, it…
Although neural-based machine translation evaluation metrics, such as COMET or BLEURT, have achieved strong correlations with human judgements, they are sometimes unreliable in detecting certain phenomena that can be considered as critical…
Commit messages are natural language descriptions of code changes, which are important for program understanding and maintenance. However, writing commit messages manually is time-consuming and laborious, especially when the code is updated…
Effective code documentation is essential for collaboration, comprehension, and long-term software maintainability, yet developers often neglect it due to its repetitive nature. Automated documentation generation has evolved from heuristic…
Statistical machine translation (SMT) is a fast-growing sub-field of computational linguistics. Until now, the most popular automatic metric to measure the quality of SMT is BiLingual Evaluation Understudy (BLEU) score. Lately, SMT along…
Text generation is an important Natural Language Processing task with various applications. Although several metrics have already been introduced to evaluate the text generation methods, each of them has its own shortcomings. The most…
Automatic metrics are fundamental for the development and evaluation of machine translation systems. Judging whether, and to what extent, automatic metrics concur with the gold standard of human evaluation is not a straightforward problem.…
Commit messages play an important role in several software engineering tasks such as program comprehension and understanding program evolution. However, programmers neglect to write good commit messages. Hence, several Commit Message…
Evaluating AMR parsing accuracy involves comparing pairs of AMR graphs. The major evaluation metric, SMATCH (Cai and Knight, 2013), searches for one-to-one mappings between the nodes of two AMRs with a greedy hill-climbing algorithm, which…