Related papers: OOP: Object-Oriented Programming Evaluation Benchm…

A Multi-Language Object-Oriented Programming Benchmark for Large Language Models

Establishing fair and robust benchmarks is essential for evaluating intelligent code generation by large language models (LLMs). Our survey of 35 existing benchmarks uncovers three major imbalances: 85.7% focus on a single programming…

Software Engineering · Computer Science 2025-10-01 Shuai Wang , Liang Ding , Li Shen , Yong Luo , Han Hu , Lefei Zhang , Fu Lin

JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models

Code generation benchmarks such as HumanEval are widely adopted to evaluate LLMs' capabilities. However, after consolidating the latest 24 benchmarks, we noticed three significant imbalances. First, imbalanced programming language. 95.8% of…

Machine Learning · Computer Science 2024-10-14 Jialun Cao , Zhiyong Chen , Jiarong Wu , Shing-chi Cheung , Chang Xu

OODEval: Evaluating Large Language Models on Object-Oriented Design

Recent advances in large language models (LLMs) have driven extensive evaluations in software engineering. however, most prior work concentrates on code-level tasks, leaving software design capabilities underexplored. To fill this gap, we…

Software Engineering · Computer Science 2026-03-12 Bingxu Xiao , Yunwei Dong , Yiqi Tang , Manqing Zhang , Yifan Zhou , Chunyan Ma , Yepang Liu

Generative AI for Object-Oriented Programming: Writing the Right Code and Reasoning the Right Logic

We find ourselves in the midst of an explosion in artificial intelligence research, particularly with large language models (LLMs). These models have diverse applications spanning finance, commonsense knowledge graphs, medicine, and visual…

Software Engineering · Computer Science 2025-08-08 Gang Xu , Airong Wang , Yushan Pan

LLMs Still Can't Avoid Instanceof: An Investigation Into GPT-3.5, GPT-4 and Bard's Capacity to Handle Object-Oriented Programming Assignments

Large Language Models (LLMs) have emerged as promising tools to assist students while solving programming assignments. However, object-oriented programming (OOP), with its inherent complexity involving the identification of entities,…

Software Engineering · Computer Science 2024-03-12 Bruno Pereira Cipriano , Pedro Alves

MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation

Recent advancements in large language models (LLMs) have greatly improved code generation, specifically at the function level. For instance, GPT-4o has achieved a 91.0\% pass rate on HumanEval. However, this draws into question the adequacy…

Computation and Language · Computer Science 2025-08-19 Jianbo Dai , Jianqiao Lu , Yunlong Feng , Guangtao Zeng , Rongju Ruan , Ming Cheng , Dong Huang , Haochen Tan , Zhijiang Guo

Strategic Optimization and Challenges of Large Language Models in Object-Oriented Programming

In the area of code generation research, the emphasis has transitioned from crafting individual functions to developing class-level method code that integrates contextual information. This shift has brought several benchmarks such as…

Software Engineering · Computer Science 2024-08-28 Zinan Wang

EVALOOOP: A Self-Consistency-Centered Framework for Assessing Large Language Model Robustness in Programming

Evaluating the programming robustness of large language models (LLMs) is paramount for ensuring their reliability in AI-based software development. However, adversarial attacks exhibit fundamental limitations that compromise fair robustness…

Software Engineering · Computer Science 2026-02-17 Sen Fang , Weiyuan Ding , Mengshi Zhang , Zihao Chen , Bowen Xu

Deep Learning and Machine Learning, Advancing Big Data Analytics and Management: Object-Oriented Programming

Object-Oriented Programming (OOP) has become a crucial paradigm for managing the growing complexity of modern software systems, particularly in fields like machine learning, deep learning, large language models (LLM), and data analytics.…

Computation and Language · Computer Science 2025-12-24 Tianyang Wang , Ziqian Bi , Keyu Chen , Jiawei Xu , Qian Niu , Junyu Liu , Benji Peng , Ming Li , Sen Zhang , Xuanhe Pan , Jinlang Wang , Pohsun Feng , Yizhu Wen , Xinyuan Song , Ming Liu

CodeEval: A pedagogical approach for targeted evaluation of code-trained Large Language Models

Large Language Models (LLMs) are predominantly assessed based on their common sense reasoning, language comprehension, and logical reasoning abilities. While models trained in specialized domains like mathematics or coding have demonstrated…

Software Engineering · Computer Science 2026-01-08 Danny Brahman , Mohammad Mahoor

ProOPF: Benchmarking and Improving LLMs for Professional-Grade Power Systems Optimization Modeling

Growing renewable penetration introduces substantial uncertainty into power system operations, necessitating frequent adaptation of dispatch objectives and constraints and challenging expertise-intensive, near-real-time modeling workflows.…

Systems and Control · Electrical Eng. & Systems 2026-05-25 Chao Shen , Zihan Guo , Xu Wan , Zhenghao Yang , Yifan Zhang , Wengi Huang , Jie Song , Zongyan Zhang , Mingyang Sun

OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models

Information Technology (IT) Operations (Ops), particularly Artificial Intelligence for IT Operations (AIOps), is the guarantee for maintaining the orderly and stable operation of existing information systems. According to Gartner's…

Artificial Intelligence · Computer Science 2025-06-18 Yuhe Liu , Changhua Pei , Longlong Xu , Bohan Chen , Mingze Sun , Zhirui Zhang , Yongqian Sun , Shenglin Zhang , Kun Wang , Haiming Zhang , Jianhui Li , Gaogang Xie , Xidao Wen , Xiaohui Nie , Minghua Ma , Dan Pei

A New Benchmark for the Appropriate Evaluation of RTL Code Optimization

The rapid progress of artificial intelligence increasingly relies on efficient integrated circuit (IC) design. Recent studies have explored the use of large language models (LLMs) for generating Register Transfer Level (RTL) code, but…

Artificial Intelligence · Computer Science 2026-01-06 Yao Lu , Shang Liu , Hangan Zhou , Wenji Fang , Qijun Zhang , Zhiyao Xie

QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback

Large language models (LLMs) have increasingly been applied to automatic programming code generation. This task can be viewed as a language generation task that bridges natural language, human knowledge, and programming logic. However, it…

Computation and Language · Computer Science 2025-11-04 Taku Mikuriya , Tatsuya Ishigaki , Masayuki Kawarada , Shunya Minami , Tadashi Kadowaki , Yohichi Suzuki , Soshun Naito , Shunya Takata , Takumi Kato , Tamotsu Basseda , Reo Yamada , Hiroya Takamura

OctoPack: Instruction Tuning Code Large Language Models

Finetuning large language models (LLMs) on instructions leads to vast performance improvements on natural language tasks. We apply instruction tuning using code, leveraging the natural structure of Git commits, which pair code changes with…

Computation and Language · Computer Science 2024-02-20 Niklas Muennighoff , Qian Liu , Armel Zebaze , Qinkai Zheng , Binyuan Hui , Terry Yue Zhuo , Swayam Singh , Xiangru Tang , Leandro von Werra , Shayne Longpre

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They…

Software Engineering · Computer Science 2025-01-03 Zhaojian Yu , Yilun Zhao , Arman Cohan , Xiao-Ping Zhang

PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs

Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of HumanEval and MBPP, two popular benchmarks for…

Computation and Language · Computer Science 2024-07-08 Ankit Yadav , Himanshu Beniwal , Mayank Singh

Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'

Recently, a number of repository-level code generation benchmarks-such as CoderEval, DevEval, RepoEval, RepoBench, and LongCodeArena-have emerged to evaluate the capabilities of large language models (LLMs) beyond standalone benchmarks like…

Software Engineering · Computer Science 2025-06-26 Shanchao Liang , Yiran Hu , Nan Jiang , Lin Tan

Autonomous Multi-Objective Optimization Using Large Language Model

Multi-objective optimization problems (MOPs) are ubiquitous in real-world applications, presenting a complex challenge of balancing multiple conflicting objectives. Traditional evolutionary algorithms (EAs), though effective, often rely on…

Neural and Evolutionary Computing · Computer Science 2024-07-29 Yuxiao Huang , Shenghao Wu , Wenjie Zhang , Jibin Wu , Liang Feng , Kay Chen Tan

Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks

The application of large language models (LLMs) in the field of coding is evolving rapidly: from code assistants, to autonomous coding agents, and then to generating complete projects through natural language. Early LLM code benchmarks…

Artificial Intelligence · Computer Science 2025-05-13 Kai Xu , YiWei Mao , XinYi Guan , ZiLong Feng