Related papers: CodeScore: Evaluating Code Generation by Learning …

ICE-Score: Instructing Large Language Models to Evaluate Code

Recent advancements in the field of natural language generation have facilitated the use of large language models to assess the quality of generated text. Although these models have shown promising results in tasks such as machine…

Artificial Intelligence · Computer Science 2024-01-23 Terry Yue Zhuo

CFCEval: Evaluating Security Aspects in Code Generated by Large Language Models

Code-focused Large Language Models (LLMs), such as CodeX and Star-Coder, have demonstrated remarkable capabilities in enhancing developer productivity through context-aware code generation. However, evaluating the quality and security of…

Software Engineering · Computer Science 2025-12-09 Cheng Cheng , Jinqiu Yang

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

Large Language Models (LLMs) have demonstrated remarkable performance on assisting humans in programming and facilitating programming automation. However, existing benchmarks for evaluating the code understanding and generation capacities…

Computation and Language · Computer Science 2024-06-10 Weixiang Yan , Haitian Liu , Yunkun Wang , Yunzhe Li , Qian Chen , Wen Wang , Tingyu Lin , Weishan Zhao , Li Zhu , Hari Sundaram , Shuiguang Deng

Enhancing LLM Code Generation with Ensembles: A Similarity-Based Selection Approach

Ensemble learning has been widely used in machine learning to improve model robustness, accuracy, and generalization, but has not yet been applied to code generation tasks with large language models (LLMs). We propose an ensemble approach…

Software Engineering · Computer Science 2025-07-22 Tarek Mahmud , Bin Duan , Corina Pasareanu , Guowei Yang

Analyzing and Mitigating Surface Bias in Code Evaluation Metrics

With the increasing popularity of large language models (LLMs) and LLM-based agents, reliable and effective code evaluation metrics (CEMs) have become crucial for progress across several software engineering tasks. While popular benchmarks…

Software Engineering · Computer Science 2025-10-09 Simantika Bhattacharjee Dristi , Matthew B. Dwyer

Human-Like Code Quality Evaluation through LLM-based Recursive Semantic Comprehension

Code quality evaluation involves scoring generated code quality based on a reference code for a specific problem statement. Currently, there are two main forms of evaluating code quality: match-based evaluation and execution-based…

Software Engineering · Computer Science 2024-12-03 Fangzhou Xu , Sai Zhang , Zhenchang Xing , Xiaowang Zhang , Yahong Han , Zhiyong Feng

Is Functional Correctness Enough to Evaluate Code Language Models? Exploring Diversity of Generated Codes

Language models (LMs) have exhibited impressive abilities in generating codes from natural language requirements. In this work, we highlight the diversity of code generated by LMs as a critical criterion for evaluating their code generation…

Software Engineering · Computer Science 2024-08-28 Heejae Chon , Seonghyeon Lee , Jinyoung Yeo , Dongha Lee

A Preliminary Study of Multilingual Code Language Models for Code Generation Task Using Translated Benchmarks

Evaluating the performance of Code Language Models (CLMs) for software engineering tasks, especially in multilingual and low-resource programming language settings, poses significant challenges. These challenges are primarily due to the…

Software Engineering · Computer Science 2024-11-26 Rohit Dandamudi , Gema Rodríguez-Pérez

Evaluating LLM-Generated Code: A Benchmark and Developer Study

Code generation is one of the tasks for which the use of Large Language Models is widely adopted and highly successful. Given this popularity, there are many benchmarks dedicated to code generation that can help select the best model.…

Software Engineering · Computer Science 2026-05-12 Joanna Szych , Anne Schwerk

A Survey on Evaluating Large Language Models in Code Generation Tasks

This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks. With the rapid growth in demand for automated software development,…

Software Engineering · Computer Science 2025-03-05 Liguo Chen , Qi Guo , Hongrui Jia , Zhengran Zeng , Xin Wang , Yijiang Xu , Jian Wu , Yidong Wang , Qing Gao , Jindong Wang , Wei Ye , Shikun Zhang

DOCE: Finding the Sweet Spot for Execution-Based Code Generation

Recently, a diverse set of decoding and reranking procedures have been shown effective for LLM-based code generation. However, a comprehensive framework that links and experimentally compares these methods is missing. We address this by…

Computation and Language · Computer Science 2024-10-17 Haau-Sing Li , Patrick Fernandes , Iryna Gurevych , André F. T. Martins

On the Limitations of Embedding Based Methods for Measuring Functional Correctness for Code Generation

The task of code generation from natural language (NL2Code) has become extremely popular, especially with the advent of Large Language Models (LLMs). However, efforts to quantify and track this progress have suffered due to a lack of…

Software Engineering · Computer Science 2024-05-06 Atharva Naik

Error Understanding in Program Code With LLM-DL for Multi-label Classification

Programming is a core skill in computer science and software engineering (SE), yet identifying and resolving code errors remains challenging for both novice and experienced developers. While Large Language Models (LLMs) have shown…

Software Engineering · Computer Science 2026-03-27 Md Faizul Ibne Amin , Yutaka Watanobe , Md. Mostafizer Rahman , Daniel M. Muepu , Md. Shahajada Mia

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they…

Software Engineering · Computer Science 2020-09-29 Shuo Ren , Daya Guo , Shuai Lu , Long Zhou , Shujie Liu , Duyu Tang , Neel Sundaresan , Ming Zhou , Ambrosio Blanco , Shuai Ma

Bridging LLM-Generated Code and Requirements: Reverse Generation technique and SBC Metric for Developer Insights

The rise of Large Language Models (LLMs) in software engineering, particularly in code generation, has garnered significant attention. However, assessing the quality of AI-generated code remains a challenge due to the inherent complexity of…

Software Engineering · Computer Science 2025-02-13 Ahilan Ayyachamy Nadar Ponnusamy

Rethinking the Evaluation of Secure Code Generation

Large language models (LLMs) are widely used in software development. However, the code generated by LLMs often contains vulnerabilities. Several secure code generation methods have been proposed to address this issue, but their current…

Cryptography and Security · Computer Science 2025-11-14 Shih-Chieh Dai , Jun Xu , Guanhong Tao

COFFE: A Code Efficiency Benchmark for Code Generation

Code generation has largely improved development efficiency in the era of large language models (LLMs). With the ability to follow instructions, current LLMs can be prompted to generate code solutions given detailed descriptions in natural…

Software Engineering · Computer Science 2025-02-06 Yun Peng , Jun Wan , Yichen Li , Xiaoxue Ren

IFEvalCode: Controlled Code Generation

Code large language models (Code LLMs) have made significant progress in code generation by translating natural language descriptions into functional code; however, real-world applications often demand stricter adherence to detailed…

Computation and Language · Computer Science 2025-08-04 Jian Yang , Wei Zhang , Shukai Liu , Linzheng Chai , Yingshui Tan , Jiaheng Liu , Ge Zhang , Wangchunshu Zhou , Guanglin Niu , Zhoujun Li , Binyuan Hui , Junyang Lin

CodeGrad: Integrating Multi-Step Verification with Gradient-Based LLM Refinement

While Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, they often produce solutions that lack guarantees of correctness, robustness, and efficiency. This limitation is particularly acute in domains…

Software Engineering · Computer Science 2025-09-04 Yueke Zhang , Yifan Zhang , Kevin Leach , Yu Huang

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

With the rapid development of Large Language Models (LLMs), a large number of machine learning models have been developed to assist programming tasks including the generation of program code from natural language input. However, how to…

Artificial Intelligence · Computer Science 2024-06-19 Debalina Ghosh Paul , Hong Zhu , Ian Bayley