OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models

Shuai Wang; Liang Ding; Li Shen; Yong Luo; Bo Du; Dacheng Tao

OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models

Computation and Language 2024-02-22 v2

Authors: Shuai Wang , Liang Ding , Li Shen , Yong Luo , Bo Du , Dacheng Tao

Abstract

Advancing automated programming necessitates robust and comprehensive code generation benchmarks, yet current evaluation frameworks largely neglect object-oriented programming (OOP) in favor of functional programming (FP), e.g., HumanEval and MBPP. To address this, our study introduces a pioneering OOP-focused benchmark, featuring 431 Python programs that encompass essential OOP concepts and features like classes and encapsulation methods. We propose a novel evaluation metric, pass@o, tailored for OOP, enhancing traditional pass@k measures. Our evaluation of 23 leading large language models (LLMs), including both general and code-specialized models, reveals three key insights: 1) pass@o offers a more relevant and comprehensive assessment for OOP code generation; 2) Despite excelling in FP, code-specialized LLMs like WizardCoder lag in OOP compared to models like ChatGPT; 3) The poor performance of all advanced LLMs on our OOP benchmark highlights a critical need for improvements in this field. Our benchmark and scripts are publicly released at: https://github.com/alphadl/OOP-eval.

Keywords

code generation large language model evaluation benchmark evaluation

Cite

@article{arxiv.2401.06628,
  title  = {OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models},
  author = {Shuai Wang and Liang Ding and Li Shen and Yong Luo and Bo Du and Dacheng Tao},
  journal= {arXiv preprint arXiv:2401.06628},
  year   = {2024}
}

Comments

20 pages, 15 figures

OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models

Abstract

Keywords

Cite

Comments

Related papers