English
Related papers

Related papers: Beyond BLEU: A Semantic Evaluation Method for Code…

200 papers

Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably…

Computation and Language · Computer Science 2025-11-17 Shakib Yazdani , Yasser Hamidullah , Cristina España-Bonet , Eleftherios Avramidis , Josef van Genabith

While most neural machine translation (NMT) systems are still trained using maximum likelihood estimation, recent work has demonstrated that optimizing systems to directly improve evaluation metrics such as BLEU can substantially improve…

Computation and Language · Computer Science 2019-09-17 John Wieting , Taylor Berg-Kirkpatrick , Kevin Gimpel , Graham Neubig

Statistical machine translation (SMT) is a fast-growing sub-field of computational linguistics. Until now, the most popular automatic metric to measure the quality of SMT is BiLingual Evaluation Understudy (BLEU) score. Lately, SMT along…

Software Engineering · Computer Science 2019-06-13 Ngoc Tran , Hieu Tran , Son Nguyen , Hoan Nguyen , Tien N. Nguyen

Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they…

Software Engineering · Computer Science 2020-09-29 Shuo Ren , Daya Guo , Shuai Lu , Long Zhou , Shujie Liu , Duyu Tang , Neel Sundaresan , Ming Zhou , Ambrosio Blanco , Shuai Ma

Although neural-based machine translation evaluation metrics, such as COMET or BLEURT, have achieved strong correlations with human judgements, they are sometimes unreliable in detecting certain phenomena that can be considered as critical…

Computation and Language · Computer Science 2023-05-31 Taisiya Glushkova , Chrysoula Zerva , André F. T. Martins

Automated source code summarization is a popular software engineering research topic wherein machine translation models are employed to "translate" code snippets into relevant natural language descriptions. Most evaluations of such models…

Software Engineering · Computer Science 2021-06-17 Junayed Mahmud , Fahim Faisal , Raihan Islam Arnob , Antonios Anastasopoulos , Kevin Moran

Our research extends the Bilingual Evaluation Understudy (BLEU) evaluation technique for statistical machine translation to make it more adjustable and robust. We intend to adapt it to resemble human evaluation more. We perform experiments…

Computation and Language · Computer Science 2015-10-01 Krzysztof Wołk , Krzysztof Marasek

Large language models (LLMs) have excelled in various NLP tasks, including machine translation (MT), yet most studies focus on sentence-level translation. This work investigates the inherent capability of instruction-tuned LLMs for…

Computation and Language · Computer Science 2025-04-22 Yirong Sun , Dawei Zhu , Yanjun Chen , Erjia Xiao , Xinghao Chen , Xiaoyu Shen

Large Language Models (LLMs) are increasingly integrated into software engineering workflows, yet current benchmarks provide only coarse performance summaries that obscure the diverse capabilities and limitations of these models. This paper…

Software Engineering · Computer Science 2026-01-21 Felix Mächtle , Jan-Niclas Serr , Nils Loose , Thomas Eisenbarth

While machine translation evaluation metrics based on string overlap (e.g., BLEU) have their limitations, their computations are transparent: the BLEU score assigned to a particular candidate translation can be traced back to the presence…

Computation and Language · Computer Science 2022-10-26 Marzena Karpinska , Nishant Raj , Katherine Thai , Yixiao Song , Ankita Gupta , Mohit Iyyer

Recent advancements in the field of natural language generation have facilitated the use of large language models to assess the quality of generated text. Although these models have shown promising results in tasks such as machine…

Artificial Intelligence · Computer Science 2024-01-23 Terry Yue Zhuo

Evaluating machine translation (MT) quality in extremely low-resource language (ELRL) scenarios poses unique challenges, as widely used metrics such as BLEU, effective in high-resource settings, often misrepresent quality in data-scarce…

Computation and Language · Computer Science 2026-02-20 Sanjeev Kumar , Preethi Jyothi , Pushpak Bhattacharyya

Code quality evaluation involves scoring generated code quality based on a reference code for a specific problem statement. Currently, there are two main forms of evaluating code quality: match-based evaluation and execution-based…

Software Engineering · Computer Science 2024-12-03 Fangzhou Xu , Sai Zhang , Zhenchang Xing , Xiaowang Zhang , Yahong Han , Zhiyong Feng

Code review is a standard practice for ensuring the quality of software projects, and recent research has focused extensively on automated code review. While significant advancements have been made in generating code reviews, the automated…

Software Engineering · Computer Science 2025-01-10 Yanjie Jiang , Hui Liu , Tianyi Chen , Fu Fan , Chunhao Dong , Kui Liu , Lu Zhang

A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) suffer from…

Software Engineering · Computer Science 2024-09-06 Yihong Dong , Jiazheng Ding , Xue Jiang , Ge Li , Zhuo Li , Zhi Jin

The field of machine translation faces an under-recognized problem because of inconsistency in the reporting of scores from its dominant metric. Although people refer to "the" BLEU score, BLEU is in fact a parameterized metric whose values…

Computation and Language · Computer Science 2018-09-13 Matt Post

In this study, we leverage LLM to enhance the semantic analysis and develop similarity metrics for texts, addressing the limitations of traditional unsupervised NLP metrics like ROUGE and BLEU. We develop a framework where LLMs such as…

Computation and Language · Computer Science 2024-02-22 Shaochen Xu , Zihao Wu , Huaqin Zhao , Peng Shu , Zhengliang Liu , Wenxiong Liao , Sheng Li , Andrea Sikora , Tianming Liu , Xiang Li

LLMs show strong performance in code generation, but their outputs lack correctness guarantees. Sample-based uncertainty estimators address this by generating multiple candidate programs and measuring their disagreement. However, existing…

Software Engineering · Computer Science 2026-05-12 Weilin He , Arindam Sharma , Cristina David

This research examines how well different methods measure semantic similarity, which is important for various software engineering applications such as code search, API recommendations, automated code reviews, and refactoring tools. While…

Neural Machine Translation (NMT) systems are typically evaluated using automated metrics that assess the agreement between generated translations and ground truth candidates. To improve systems with respect to these metrics, NLP researchers…

Computation and Language · Computer Science 2020-11-30 Nicholas Roberts , Davis Liang , Graham Neubig , Zachary C. Lipton
‹ Prev 1 2 3 10 Next ›