Related papers: OODEval: Evaluating Large Language Models on Objec…

Human-Aligned Code Readability Assessment with Large Language Models

Code readability is crucial for software comprehension and maintenance, yet difficult to assess at scale. Traditional static metrics often fail to capture the subjective, context-sensitive nature of human judgments. Large Language Models…

Software Engineering · Computer Science 2025-10-21 Wendkûuni C. Ouédraogo , Yinghua Li , Xueqi Dang , Pawel Borsukiewicz , Xin Zhou , Anil Koyuncu , Jacques Klein , David Lo , Tegawendé F. Bissyandé

FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks

Code repair is a fundamental task in software development, facilitating efficient bug resolution and software maintenance. Although large language models (LLMs) have demonstrated considerable potential in automated code repair, their…

Software Engineering · Computer Science 2026-02-27 Dekun Dai , MingWei Liu , Anji Li , Jialun Cao , Yanlin Wang , Chong Wang , Xin Peng , Zibin Zheng

OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

While Large Language Models (LLMs) demonstrate remarkable reasoning, complex optimization tasks remain challenging, requiring domain knowledge and robust implementation. However, existing benchmarks focus narrowly on Mathematical…

Computation and Language · Computer Science 2026-04-24 Xinyu Zhang , Boxuan Zhang , Yuchen Wan , Lingling Zhang , YiXing Yao , Bifan Wei , Yaqiang Wu , Jun Liu

OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models

Advancing automated programming necessitates robust and comprehensive code generation benchmarks, yet current evaluation frameworks largely neglect object-oriented programming (OOP) in favor of functional programming (FP), e.g., HumanEval…

Computation and Language · Computer Science 2024-02-22 Shuai Wang , Liang Ding , Li Shen , Yong Luo , Bo Du , Dacheng Tao

Strategic Optimization and Challenges of Large Language Models in Object-Oriented Programming

In the area of code generation research, the emphasis has transitioned from crafting individual functions to developing class-level method code that integrates contextual information. This shift has brought several benchmarks such as…

Software Engineering · Computer Science 2024-08-28 Zinan Wang

CodeEval: A pedagogical approach for targeted evaluation of code-trained Large Language Models

Large Language Models (LLMs) are predominantly assessed based on their common sense reasoning, language comprehension, and logical reasoning abilities. While models trained in specialized domains like mathematics or coding have demonstrated…

Software Engineering · Computer Science 2026-01-08 Danny Brahman , Mohammad Mahoor

FollowEval: A Multi-Dimensional Benchmark for Assessing the Instruction-Following Capability of Large Language Models

The effective assessment of the instruction-following ability of large language models (LLMs) is of paramount importance. A model that cannot adhere to human instructions might be not able to provide reliable and helpful responses. In…

Computation and Language · Computer Science 2023-11-17 Yimin Jing , Renren Jin , Jiahao Hu , Huishi Qiu , Xiaohua Wang , Peng Wang , Deyi Xiong

JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models

Code generation benchmarks such as HumanEval are widely adopted to evaluate LLMs' capabilities. However, after consolidating the latest 24 benchmarks, we noticed three significant imbalances. First, imbalanced programming language. 95.8% of…

Machine Learning · Computer Science 2024-10-14 Jialun Cao , Zhiyong Chen , Jiarong Wu , Shing-chi Cheung , Chang Xu

Can LLMs Produce Better Object-Oriented Designs than Human-Involved Development?

Background: Large Language Models (LLMs) are increasingly used for code generation. However, their ability to generate multi-class projects that require object-oriented design (OOD) remains unclear, especially relative to projects developed…

Software Engineering · Computer Science 2026-05-20 Zushuai Zhang , Elliott Wen , Ewan Tempero

SwiftEval: Developing a Language-Specific Benchmark for LLM-generated Code Evaluation

In recent years, large language models (LLMs) have showcased significant advancements in code generation. However, most evaluation benchmarks are primarily oriented towards Python, making it difficult to evaluate other programming…

Machine Learning · Computer Science 2025-06-02 Ivan Petrukha , Yana Kurliak , Nataliia Stulova

ORGEval: Graph-Theoretic Evaluation of LLMs in Optimization Modeling

Formulating optimization problems for industrial applications demands significant manual effort and domain expertise. While Large Language Models (LLMs) show promise in automating this process, evaluating their performance remains difficult…

Machine Learning · Computer Science 2025-11-03 Zhuohan Wang , Ziwei Zhu , Ziniu Li , Congliang Chen , Yizhou Han , Yufeng Lin , Zhihang Lin , Angyang Gu , Xinglin Hu , Ruoyu Sun , Tian Ding

Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models

The performance of large language models (LLMs) on existing reasoning benchmarks has significantly improved over the past years. In response, we present JEEBench, a considerably more challenging benchmark dataset for evaluating the problem…

Computation and Language · Computer Science 2023-10-24 Daman Arora , Himanshu Gaurav Singh , Mausam

A Multi-Language Object-Oriented Programming Benchmark for Large Language Models

Establishing fair and robust benchmarks is essential for evaluating intelligent code generation by large language models (LLMs). Our survey of 35 existing benchmarks uncovers three major imbalances: 85.7% focus on a single programming…

Software Engineering · Computer Science 2025-10-01 Shuai Wang , Liang Ding , Li Shen , Yong Luo , Han Hu , Lefei Zhang , Fu Lin

LLMs Still Can't Avoid Instanceof: An Investigation Into GPT-3.5, GPT-4 and Bard's Capacity to Handle Object-Oriented Programming Assignments

Large Language Models (LLMs) have emerged as promising tools to assist students while solving programming assignments. However, object-oriented programming (OOP), with its inherent complexity involving the identification of entities,…

Software Engineering · Computer Science 2024-03-12 Bruno Pereira Cipriano , Pedro Alves

Large Language Models and Mathematical Reasoning Failures

This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems. Unlike prior studies that focus solely on answer correctness, we rigorously analyze…

Artificial Intelligence · Computer Science 2025-02-24 Johan Boye , Birger Moell

xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval

Recently, pre-trained large language models (LLMs) have shown impressive abilities in generating codes from natural language descriptions, repairing buggy codes, translating codes between languages, and retrieving relevant code segments.…

Computation and Language · Computer Science 2023-11-07 Mohammad Abdullah Matin Khan , M Saiful Bari , Xuan Long Do , Weishi Wang , Md Rizwan Parvez , Shafiq Joty

Mathematical Computation and Reasoning Errors by Large Language Models

Large Language Models (LLMs) are increasingly utilized in AI-driven educational instruction and assessment, particularly within mathematics education. The capability of LLMs to generate accurate answers and detailed solutions for math…

Artificial Intelligence · Computer Science 2025-08-15 Liang Zhang , Edith Aurora Graf

What is the best model? Application-driven Evaluation for Large Language Models

General large language models enhanced with supervised fine-tuning and reinforcement learning from human feedback are increasingly popular in academia and industry as they generalize foundation models to various practical tasks in a prompt…

Computation and Language · Computer Science 2024-06-18 Shiguo Lian , Kaikai Zhao , Xinhui Liu , Xuejiao Lei , Bikun Yang , Wenjing Zhang , Kai Wang , Zhaoxiang Liu

Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs)

Objective Structured Clinical Examinations (OSCEs) are widely used to assess medical students' communication skills, but scoring interview-based assessments is time-consuming and potentially subject to human bias. This study explored the…

Computation and Language · Computer Science 2025-05-16 Jadon Geathers , Yann Hicke , Colleen Chan , Niroop Rajashekar , Justin Sewell , Susannah Cornes , Rene F. Kizilcec , Dennis Shung

Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study

Recent advancements in large language models (LLMs) have significantly enhanced their coding capabilities. However, existing benchmarks predominantly focused on simplified or isolated aspects of coding, such as single-file code generation…

Computation and Language · Computer Science 2024-12-17 Bowen Li , Wenhan Wu , Ziwei Tang , Lin Shi , John Yang , Jinyang Li , Shunyu Yao , Chen Qian , Binyuan Hui , Qicheng Zhang , Zhiyin Yu , He Du , Ping Yang , Dahua Lin , Chao Peng , Kai Chen