English
Related papers

Related papers: OODEval: Evaluating Large Language Models on Objec…

200 papers

Code readability is crucial for software comprehension and maintenance, yet difficult to assess at scale. Traditional static metrics often fail to capture the subjective, context-sensitive nature of human judgments. Large Language Models…

Code repair is a fundamental task in software development, facilitating efficient bug resolution and software maintenance. Although large language models (LLMs) have demonstrated considerable potential in automated code repair, their…

Software Engineering · Computer Science 2026-02-27 Dekun Dai , MingWei Liu , Anji Li , Jialun Cao , Yanlin Wang , Chong Wang , Xin Peng , Zibin Zheng

While Large Language Models (LLMs) demonstrate remarkable reasoning, complex optimization tasks remain challenging, requiring domain knowledge and robust implementation. However, existing benchmarks focus narrowly on Mathematical…

Computation and Language · Computer Science 2026-04-24 Xinyu Zhang , Boxuan Zhang , Yuchen Wan , Lingling Zhang , YiXing Yao , Bifan Wei , Yaqiang Wu , Jun Liu

Advancing automated programming necessitates robust and comprehensive code generation benchmarks, yet current evaluation frameworks largely neglect object-oriented programming (OOP) in favor of functional programming (FP), e.g., HumanEval…

Computation and Language · Computer Science 2024-02-22 Shuai Wang , Liang Ding , Li Shen , Yong Luo , Bo Du , Dacheng Tao

In the area of code generation research, the emphasis has transitioned from crafting individual functions to developing class-level method code that integrates contextual information. This shift has brought several benchmarks such as…

Software Engineering · Computer Science 2024-08-28 Zinan Wang

Large Language Models (LLMs) are predominantly assessed based on their common sense reasoning, language comprehension, and logical reasoning abilities. While models trained in specialized domains like mathematics or coding have demonstrated…

Software Engineering · Computer Science 2026-01-08 Danny Brahman , Mohammad Mahoor

The effective assessment of the instruction-following ability of large language models (LLMs) is of paramount importance. A model that cannot adhere to human instructions might be not able to provide reliable and helpful responses. In…

Computation and Language · Computer Science 2023-11-17 Yimin Jing , Renren Jin , Jiahao Hu , Huishi Qiu , Xiaohua Wang , Peng Wang , Deyi Xiong

Code generation benchmarks such as HumanEval are widely adopted to evaluate LLMs' capabilities. However, after consolidating the latest 24 benchmarks, we noticed three significant imbalances. First, imbalanced programming language. 95.8% of…

Machine Learning · Computer Science 2024-10-14 Jialun Cao , Zhiyong Chen , Jiarong Wu , Shing-chi Cheung , Chang Xu

Background: Large Language Models (LLMs) are increasingly used for code generation. However, their ability to generate multi-class projects that require object-oriented design (OOD) remains unclear, especially relative to projects developed…

Software Engineering · Computer Science 2026-05-20 Zushuai Zhang , Elliott Wen , Ewan Tempero

In recent years, large language models (LLMs) have showcased significant advancements in code generation. However, most evaluation benchmarks are primarily oriented towards Python, making it difficult to evaluate other programming…

Machine Learning · Computer Science 2025-06-02 Ivan Petrukha , Yana Kurliak , Nataliia Stulova

Formulating optimization problems for industrial applications demands significant manual effort and domain expertise. While Large Language Models (LLMs) show promise in automating this process, evaluating their performance remains difficult…

Machine Learning · Computer Science 2025-11-03 Zhuohan Wang , Ziwei Zhu , Ziniu Li , Congliang Chen , Yizhou Han , Yufeng Lin , Zhihang Lin , Angyang Gu , Xinglin Hu , Ruoyu Sun , Tian Ding

The performance of large language models (LLMs) on existing reasoning benchmarks has significantly improved over the past years. In response, we present JEEBench, a considerably more challenging benchmark dataset for evaluating the problem…

Computation and Language · Computer Science 2023-10-24 Daman Arora , Himanshu Gaurav Singh , Mausam

Establishing fair and robust benchmarks is essential for evaluating intelligent code generation by large language models (LLMs). Our survey of 35 existing benchmarks uncovers three major imbalances: 85.7% focus on a single programming…

Software Engineering · Computer Science 2025-10-01 Shuai Wang , Liang Ding , Li Shen , Yong Luo , Han Hu , Lefei Zhang , Fu Lin

Large Language Models (LLMs) have emerged as promising tools to assist students while solving programming assignments. However, object-oriented programming (OOP), with its inherent complexity involving the identification of entities,…

Software Engineering · Computer Science 2024-03-12 Bruno Pereira Cipriano , Pedro Alves

This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems. Unlike prior studies that focus solely on answer correctness, we rigorously analyze…

Artificial Intelligence · Computer Science 2025-02-24 Johan Boye , Birger Moell

Recently, pre-trained large language models (LLMs) have shown impressive abilities in generating codes from natural language descriptions, repairing buggy codes, translating codes between languages, and retrieving relevant code segments.…

Computation and Language · Computer Science 2023-11-07 Mohammad Abdullah Matin Khan , M Saiful Bari , Xuan Long Do , Weishi Wang , Md Rizwan Parvez , Shafiq Joty

Large Language Models (LLMs) are increasingly utilized in AI-driven educational instruction and assessment, particularly within mathematics education. The capability of LLMs to generate accurate answers and detailed solutions for math…

Artificial Intelligence · Computer Science 2025-08-15 Liang Zhang , Edith Aurora Graf

General large language models enhanced with supervised fine-tuning and reinforcement learning from human feedback are increasingly popular in academia and industry as they generalize foundation models to various practical tasks in a prompt…

Computation and Language · Computer Science 2024-06-18 Shiguo Lian , Kaikai Zhao , Xinhui Liu , Xuejiao Lei , Bikun Yang , Wenjing Zhang , Kai Wang , Zhaoxiang Liu

Objective Structured Clinical Examinations (OSCEs) are widely used to assess medical students' communication skills, but scoring interview-based assessments is time-consuming and potentially subject to human bias. This study explored the…

Computation and Language · Computer Science 2025-05-16 Jadon Geathers , Yann Hicke , Colleen Chan , Niroop Rajashekar , Justin Sewell , Susannah Cornes , Rene F. Kizilcec , Dennis Shung

Recent advancements in large language models (LLMs) have significantly enhanced their coding capabilities. However, existing benchmarks predominantly focused on simplified or isolated aspects of coding, such as single-file code generation…

Computation and Language · Computer Science 2024-12-17 Bowen Li , Wenhan Wu , Ziwei Tang , Lin Shi , John Yang , Jinyang Li , Shunyu Yao , Chen Qian , Binyuan Hui , Qicheng Zhang , Zhiyin Yu , He Du , Ping Yang , Dahua Lin , Chao Peng , Kai Chen
‹ Prev 1 2 3 10 Next ›