What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Shihan Dou; Haoxiang Jia; Shenxi Wu; Huiyuan Zheng; Muling Wu; Yunbo Tao; Ming Zhang; Mingxu Chai; Jessica Fan; Zhiheng Xi; Rui Zheng; Yueming Wu; Ming Wen; Tao Gui; Qi Zhang; Xipeng Qiu; Xuanjing Huang

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Software Engineering 2025-10-20 v2 Computation and Language

Authors: Shihan Dou , Haoxiang Jia , Shenxi Wu , Huiyuan Zheng , Muling Wu , Yunbo Tao , Ming Zhang , Mingxu Chai , Jessica Fan , Zhiheng Xi , Rui Zheng , Yueming Wu , Ming Wen , Tao Gui , Qi Zhang , Xipeng Qiu , Xuanjing Huang

View on arXiv ↗ PDF ↗

Abstract

The increasing development of LLMs in code generation has drawn significant attention among researchers. To enhance LLM-based code generation ability, current efforts are predominantly directed towards collecting high-quality datasets and leveraging diverse training technologies. However, there is a notable lack of comprehensive studies examining the limitations and boundaries of existing methods. To bridge this gap, we conducted an extensive empirical study evaluating the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks. Our investigation, which evaluated the length, cyclomatic complexity and API number of the generated code, revealed that these LLMs face challenges in generating successful code for more complex problems, and tend to produce code that is shorter yet more complicated as compared to canonical solutions. Additionally, we developed a taxonomy of bugs for incorrect codes that includes three categories and ten sub-categories, and analyzed the root cause for common bug types. To better understand the performance of LLMs in real-world projects, we also manually created a real-world benchmark RWPB. We analyzed bugs on RWPB to highlight distinct differences in bug distributions between actual scenarios and existing benchmarks. Finally, we propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback. Our comprehensive and extensive study provides insights into the current limitations of LLM-based code generation and opportunities for enhancing the accuracy and quality of the generated code.

Keywords

code generation automated program repair large language model

Cite

@article{arxiv.2407.06153,
  title  = {What's Wrong with Your Code Generated by Large Language Models? An Extensive Study},
  author = {Shihan Dou and Haoxiang Jia and Shenxi Wu and Huiyuan Zheng and Muling Wu and Yunbo Tao and Ming Zhang and Mingxu Chai and Jessica Fan and Zhiheng Xi and Rui Zheng and Yueming Wu and Ming Wen and Tao Gui and Qi Zhang and Xipeng Qiu and Xuanjing Huang},
  journal= {arXiv preprint arXiv:2407.06153},
  year   = {2025}
}

Comments

Accepted by SCIENCE CHINA Information Sciences (SCIS)

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Abstract

Keywords

Cite

Comments

Related papers