English

DevEval: Evaluating Code Generation in Practical Software Projects

Software Engineering 2024-03-07 v4 Artificial Intelligence Computation and Language

Abstract

How to evaluate Large Language Models (LLMs) in code generation is an open question. Many benchmarks have been proposed but are inconsistent with practical software projects, e.g., unreal program distributions, insufficient dependencies, and small-scale project contexts. Thus, the capabilities of LLMs in practical projects are still unclear. In this paper, we propose a new benchmark named DevEval, aligned with Developers' experiences in practical projects. DevEval is collected through a rigorous pipeline, containing 2,690 samples from 119 practical projects and covering 10 domains. Compared to previous benchmarks, DevEval aligns to practical projects in multiple dimensions, e.g., real program distributions, sufficient dependencies, and enough-scale project contexts. We assess five popular LLMs on DevEval (e.g., gpt-4, gpt-3.5-turbo, CodeLLaMa, and StarCoder) and reveal their actual abilities in code generation. For instance, the highest Pass@1 of gpt-3.5-turbo only is 42 in our experiments. We also discuss the challenges and future directions of code generation in practical projects. We open-source DevEval and hope it can facilitate the development of code generation in practical projects.

Keywords

Cite

@article{arxiv.2401.06401,
  title  = {DevEval: Evaluating Code Generation in Practical Software Projects},
  author = {Jia Li and Ge Li and Yunfei Zhao and Yongmin Li and Zhi Jin and Hao Zhu and Huanyu Liu and Kaibo Liu and Lecheng Wang and Zheng Fang and Lanshen Wang and Jiazheng Ding and Xuanming Zhang and Yihong Dong and Yuqi Zhu and Bin Gu and Mengfei Yang},
  journal= {arXiv preprint arXiv:2401.06401},
  year   = {2024}
}

Comments

We are re-checking this benchmark and repeating related experiments. New versions of DevEval will be released later