Related papers: Beyond Code Generation: Assessing Code LLM Maturit…

A Survey on Evaluating Large Language Models in Code Generation Tasks

This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks. With the rapid growth in demand for automated software development,…

Software Engineering · Computer Science 2025-03-05 Liguo Chen , Qi Guo , Hongrui Jia , Zhengran Zeng , Xin Wang , Yijiang Xu , Jian Wu , Yidong Wang , Qing Gao , Jindong Wang , Wei Ye , Shikun Zhang

Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications

Large Language Models (LLMs) have demonstrated their remarkable capabilities in numerous fields. This survey focuses on how LLMs empower users, regardless of their technical background, to use human languages to automatically generate…

Software Engineering · Computer Science 2025-04-03 Nam Huynh , Beiyu Lin

Beyond Postconditions: Can Large Language Models infer Formal Contracts for Automatic Software Verification?

Automatic software verifiers have become increasingly effective at the task of checking software against (formal) specifications. Yet, their adoption in practice has been hampered by the lack of such specifications in real world code. Large…

Software Engineering · Computer Science 2025-10-15 Cedric Richter , Heike Wehrheim

A Survey on Large Language Models for Code Generation

Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks, known as Code LLMs, particularly in code generation that generates source code with LLM from natural language descriptions. This…

Computation and Language · Computer Science 2025-10-28 Juyong Jiang , Fan Wang , Jiasi Shen , Sungju Kim , Sunghun Kim

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

With the rapid development of Large Language Models (LLMs), a large number of machine learning models have been developed to assist programming tasks including the generation of program code from natural language input. However, how to…

Artificial Intelligence · Computer Science 2024-06-19 Debalina Ghosh Paul , Hong Zhu , Ian Bayley

Combining LLM Code Generation with Formal Specifications and Reactive Program Synthesis

In the past few years, Large Language Models (LLMs) have exploded in usefulness and popularity for code generation tasks. However, LLMs still struggle with accuracy and are unsuitable for high-risk applications without additional oversight…

Software Engineering · Computer Science 2024-10-29 William Murphy , Nikolaus Holzer , Feitong Qiao , Leyi Cui , Raven Rothkopf , Nathan Koenig , Mark Santolucito

Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?

Informal natural language that describes code functionality, such as code comments or function documentation, may contain substantial information about a programs intent. However, there is typically no guarantee that a programs…

Software Engineering · Computer Science 2024-04-17 Madeline Endres , Sarah Fakhoury , Saikat Chakraborty , Shuvendu K. Lahiri

Talk is Cheap, Logic is Hard: Benchmarking LLMs on Post-Condition Formalization

Formal specifications, such as pre- and post-conditions provide a solid basis for performing thorough program verification. However, developers rarely provide such formal specifications, hence if AI could help in constructing them, it would…

Software Engineering · Computer Science 2026-03-19 I. S. W. B. Prasetya , Fitsum Kifetew , Davide Prandi

CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of…

Software Engineering · Computer Science 2025-11-03 Forough Mehralian , Ryan Shar , James R. Rae , Alireza Hashemi

Code Evolution Graphs: Understanding Large Language Model Driven Design of Algorithms

Large Language Models (LLMs) have demonstrated great promise in generating code, especially when used inside an evolutionary computation framework to iteratively optimize the generated algorithms. However, in some cases they fail to…

Neural and Evolutionary Computing · Computer Science 2025-03-24 Niki van Stein , Anna V. Kononova , Lars Kotthoff , Thomas Bäck

Enhancing LLM Code Generation: A Systematic Evaluation of Multi-Agent Collaboration and Runtime Debugging for Improved Accuracy, Reliability, and Latency

The use of large language models (LLMs) for automated code generation has emerged as a significant focus within AI research. As these pretrained models continue to evolve, their ability to understand and generate complex code structures has…

Software Engineering · Computer Science 2025-05-06 Nazmus Ashrafi , Salah Bouktif , Mohammed Mediani

HumanEvalComm: Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agent

Large language models (LLMs) have significantly improved their ability to perform tasks in the field of code generation. However, there is still a gap between LLMs being capable coders and being top-tier software engineers. Based on the…

Software Engineering · Computer Science 2025-01-29 Jie JW Wu , Fatemeh H Fard

Breaking the Myth: Can Small Models Infer Postconditions Too?

Formal specifications are essential for ensuring software correctness, yet manually writing them is tedious and error-prone. Large Language Models (LLMs) have shown promise in generating such specifications from natural language intents,…

Software Engineering · Computer Science 2025-07-15 Gehao Zhang , Zhenting Wang , Juan Zhai

L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models

Recently, large language models (LLMs), especially those that are pretrained on code, have demonstrated strong capabilities in generating programs from natural language inputs in a few-shot or even zero-shot manner. Despite promising…

Computation and Language · Computer Science 2023-10-03 Ansong Ni , Pengcheng Yin , Yilun Zhao , Martin Riddell , Troy Feng , Rui Shen , Stephen Yin , Ye Liu , Semih Yavuz , Caiming Xiong , Shafiq Joty , Yingbo Zhou , Dragomir Radev , Arman Cohan

POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference

Formal postconditions precisely characterize program behavior and support debugging, testing, and verification, but writing them requires substantial expertise and effort. This has motivated recent work on automatically generating…

Software Engineering · Computer Science 2026-05-06 Gehao Zhang , Juan Zhai

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure…

Software Engineering · Computer Science 2023-11-01 Jiawei Liu , Chunqiu Steven Xia , Yuyao Wang , Lingming Zhang

DeCon: Detecting Incorrect Assertions via Postconditions Generated by a Large Language Model

Recently, given the docstring for the target problem and the target function signature, large language models (LLMs) have been used not only to generate source code, but also to generate test cases, consisting of test inputs and assertions…

Software Engineering · Computer Science 2025-01-07 Hao Yu , Tianyu Chen , Jiaming Huang , Zongyang Li , Dezhi Ran , Xinyu Wang , Ying Li , Assaf Marron , David Harel , Yuan Xie , Tao Xie

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

Large language models (LLMs) frequently generate defective outputs in code generation tasks, ranging from logical bugs to security vulnerabilities. While these generation failures are often treated as model-level limitations, empirical…

Software Engineering · Computer Science 2026-05-08 Kaifeng He , Xiaojun Zhang , Peiliang Cai , Mingwei Liu , Yanlin Wang , Chong Wang , Kaifeng Huang , Bihuan Chen , Xin Peng , Zibin Zheng

VHDL-Eval: A Framework for Evaluating Large Language Models in VHDL Code Generation

With the unprecedented advancements in Large Language Models (LLMs), their application domains have expanded to include code generation tasks across various programming languages. While significant progress has been made in enhancing LLMs…

Software Engineering · Computer Science 2024-06-10 Prashanth Vijayaraghavan , Luyao Shi , Stefano Ambrogio , Charles Mackin , Apoorva Nitsure , David Beymer , Ehsan Degan

Revisit Self-Debugging with Self-Generated Tests for Code Generation

Large language models (LLMs) have shown significant advancements in code generation, but still face challenges on tasks beyond their basic capabilities. Recently, the notion of self-debugging has been proposed to boost the performance of…

Software Engineering · Computer Science 2025-01-23 Xiancai Chen , Zhengwei Tao , Kechi Zhang , Changzhi Zhou , Wanli Gu , Yuanpeng He , Mengdi Zhang , Xunliang Cai , Haiyan Zhao , Zhi Jin