Related papers: Beyond Correctness: Benchmarking Multi-dimensional…

Is Functional Correctness Enough to Evaluate Code Language Models? Exploring Diversity of Generated Codes

Language models (LMs) have exhibited impressive abilities in generating codes from natural language requirements. In this work, we highlight the diversity of code generated by LMs as a critical criterion for evaluating their code generation…

Software Engineering · Computer Science 2024-08-28 Heejae Chon , Seonghyeon Lee , Jinyoung Yeo , Dongha Lee

Beyond Functional Correctness: Investigating Coding Style Inconsistencies in Large Language Models

Large language models (LLMs) have brought a paradigm shift to the field of code generation, offering the potential to enhance the software development process. However, previous research mainly focuses on the accuracy of code generation,…

Software Engineering · Computer Science 2025-06-24 Yanlin Wang , Tianyue Jiang , Mingwei Liu , Jiachi Chen , Mingzhi Mao , Xilin Liu , Yuchi Ma , Zibin Zheng

A Survey on Evaluating Large Language Models in Code Generation Tasks

This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks. With the rapid growth in demand for automated software development,…

Software Engineering · Computer Science 2025-03-05 Liguo Chen , Qi Guo , Hongrui Jia , Zhengran Zeng , Xin Wang , Yijiang Xu , Jian Wu , Yidong Wang , Qing Gao , Jindong Wang , Wei Ye , Shikun Zhang

Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation

Large language models (LLMs) have demonstrated strong performance on function-level code generation benchmarks, yet real-world software development increasingly demands class-level implementations that integrate multiple methods,…

Software Engineering · Computer Science 2025-11-06 Musfiqur Rahman , SayedHassan Khatoonabadi , Emad Shihab

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

With the rapid development of Large Language Models (LLMs), a large number of machine learning models have been developed to assist programming tasks including the generation of program code from natural language input. However, how to…

Artificial Intelligence · Computer Science 2024-06-19 Debalina Ghosh Paul , Hong Zhu , Ian Bayley

Large Language Models for Code Generation: The Practitioners Perspective

Large Language Models (LLMs) have emerged as coding assistants, capable of generating source code from natural language prompts. With the increasing adoption of LLMs in software development, academic research and industry based projects are…

Software Engineering · Computer Science 2025-01-29 Zeeshan Rasheed , Muhammad Waseem , Kai Kristian Kemell , Aakash Ahmad , Malik Abdul Sami , Jussi Rasku , Kari Systä , Pekka Abrahamsson

Coding Triangle: How Does Large Language Model Understand Code?

Large language models (LLMs) have achieved remarkable progress in code generation, yet their true programming competence remains underexplored. We introduce the Code Triangle framework, which systematically evaluates LLMs across three…

Computation and Language · Computer Science 2025-07-09 Taolin Zhang , Zihan Ma , Maosong Cao , Junnan Liu , Songyang Zhang , Kai Chen

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness…

Software Engineering · Computer Science 2026-05-18 Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song

CoRE: A Fine-Grained Code Reasoning Benchmark Beyond Output Prediction

Despite strong performance on code generation tasks, it remains unclear whether large language models (LLMs) genuinely reason about code execution. Existing code reasoning benchmarks primarily evaluate final output correctness under a…

Software Engineering · Computer Science 2026-04-29 Jun Gao , Yun Peng , Qian Qiao , Changhai Zhou , Yuhua Zhou , Shiyang Zhang , Shichao Weng , Zhenchang Xing , Xiaoxue Ren

Beyond Output Correctness: Benchmarking and Evaluating Large Language Model Reasoning in Coding Tasks

Large language models (LLMs) increasingly rely on explicit reasoning to solve coding tasks, yet evaluating the quality of this reasoning remains challenging. Existing reasoning evaluators are not designed for coding, and current benchmarks…

Software Engineering · Computer Science 2026-04-15 Yuangang Li , Justin Tian Jin Chen , Ethan Yu , David Hong , Iftekhar Ahmed

Evaluating LLM-Generated Code: A Benchmark and Developer Study

Code generation is one of the tasks for which the use of Large Language Models is widely adopted and highly successful. Given this popularity, there are many benchmarks dedicated to code generation that can help select the best model.…

Software Engineering · Computer Science 2026-05-12 Joanna Szych , Anne Schwerk

FairCoder: Evaluating Social Bias of LLMs in Code Generation

Large language models (LLMs) have been widely deployed in coding tasks, drawing increasing attention to the evaluation of the quality and safety of LLMs' outputs. However, research on bias in code generation remains limited. Existing…

Computation and Language · Computer Science 2025-04-03 Yongkang Du , Jen-tse Huang , Jieyu Zhao , Lu Lin

Are LLMs Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement

Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks. Software engineers often rely on LLMs to verify if code implementation satisfy…

Software Engineering · Computer Science 2026-03-03 Haolin Jin , Huaming Chen

Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice

Code review is a cornerstone of software quality assurance, and recent advances in Large Language Models (LLMs) have shown promise in its automation. However, existing benchmarks for LLM-based code review face three major limitations. Lack…

Software Engineering · Computer Science 2026-01-01 Ruida Hu , Xinchen Wang , Xin-Cheng Wen , Zhao Zhang , Bo Jiang , Pengfei Gao , Chao Peng , Cuiyun Gao

CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation

Large language models (LLMs) play a crucial role in software engineering, excelling in tasks like code generation and maintenance. However, existing benchmarks are often narrow in scope, focusing on a specific task and lack a comprehensive…

Software Engineering · Computer Science 2025-04-30 Wenjing Yin , Tianze Sun , Yijiong Yu , Jiawei Fang , Guangyao Su , Jiancheng Wang , Zekun Wang , Wei Wang , Ran Chen , Ziyun Dai , Shuai Yuan , Menghang Dong , Peng Luo , Dong Cao , Da Lei , Yajun Zhang , Hao Chen , Xiang Ma , Yong Liu , Weifeng Liu , Yuanjian Xu , Ji Pei

Sustainable Code Generation Using Large Language Models: A Systematic Literature Review

Large Language Models (LLMs) are widely used in software engineering to generate, complete, translate, and fix code, improving developer productivity. While most research focuses on the energy consumption and carbon emissions of model…

Software Engineering · Computer Science 2026-04-15 Sabiya Banu Masthan Ali , Oussema Kirmani , Aroosa Hameed , Syed Muhammad Danish , Gautam Srivastava

RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area. Existing benchmarks often fall short by relying on synthetic…

Cryptography and Security · Computer Science 2026-02-02 Yanlin Wang , Ziyao Zhang , Chong Wang , Xinyi Xu , Mingwei Liu , Yong Wang , Jiachi Chen , Zibin Zheng

Benchmarking Linguistic Diversity of Large Language Models

The development and evaluation of Large Language Models (LLMs) has primarily focused on their task-solving capabilities, with recent models even surpassing human performance in some areas. However, this focus often neglects whether…

Computation and Language · Computer Science 2025-07-29 Yanzhu Guo , Guokan Shang , Chloé Clavel

Insights from Benchmarking Frontier Language Models on Web App Code Generation

This paper presents insights from evaluating 16 frontier large language models (LLMs) on the WebApp1K benchmark, a test suite designed to assess the ability of LLMs to generate web application code. The results reveal that while all models…

Software Engineering · Computer Science 2024-09-10 Yi Cui

Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models

Large Language Models (LLMs) are increasingly integrated into software engineering workflows, yet current benchmarks provide only coarse performance summaries that obscure the diverse capabilities and limitations of these models. This paper…

Software Engineering · Computer Science 2026-01-21 Felix Mächtle , Jan-Niclas Serr , Nils Loose , Thomas Eisenbarth