Related papers: Benchmarking Educational Program Repair

From Benchmark Data To Applicable Program Repair: An Experience Report

This paper describes our approach to automated program repair. We combine various techniques from the literature to achieve this. Our experiments show that our approach performs better than other techniques on standard benchmarks. However,…

Software Engineering · Computer Science 2025-08-25 Mahinthan Chandramohan , Jovan Jancic , Yuntong Zhang , Padmanabhan Krishnan

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

With the rapid development of Large Language Models (LLMs), a large number of machine learning models have been developed to assist programming tasks including the generation of program code from natural language input. However, how to…

Artificial Intelligence · Computer Science 2024-06-19 Debalina Ghosh Paul , Hong Zhu , Ian Bayley

LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for…

Artificial Intelligence · Computer Science 2026-03-02 Antoine Peyronnet , Fabian Gloeckle , Amaury Hayat

Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks

Large language models (LLMs) are gaining increasing popularity in software engineering (SE) due to their unprecedented performance across various applications. These models are increasingly being utilized for a range of SE tasks, including…

Software Engineering · Computer Science 2025-11-05 Xing Hu , Feifei Niu , Junkai Chen , Xin Zhou , Junwei Zhang , Junda He , Xin Xia , David Lo

RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics

Existing benchmarks for evaluating mathematical reasoning in large language models (LLMs) rely primarily on competition problems, formal proofs, or artificially challenging questions -- failing to capture the nature of mathematics…

Artificial Intelligence · Computer Science 2025-10-21 Jie Zhang , Cezara Petrui , Kristina Nikolić , Florian Tramèr

A Benchmark for Language Models in Real-World System Building

During migration across instruction set architectures (ISAs), software package build repair is a critical task for ensuring the reliability of software deployment and the stability of modern operating systems. While Large Language Models…

Software Engineering · Computer Science 2026-01-21 Weilin Jin , Chenyu Zhao , Zeshun Huang , Chaoyun Zhang , Qingwei Lin , Chetan Bansal , Saravan Rajmohan , Shenglin Zhang , Yongqian Sun , Dan Pei , Yifan Wu , Tong Jia , Ying Li , Zhonghai Wu , Minghua Ma

Automated Program Repair: Emerging trends pose and expose problems for benchmarks

Machine learning (ML) now pervades the field of Automated Program Repair (APR). Algorithms deploy neural machine translation and large language models (LLMs) to generate software patches, among other tasks. But, there are important…

Software Engineering · Computer Science 2024-05-10 Joseph Renzullo , Pemma Reiter , Westley Weimer , Stephanie Forrest

Don't Make Your LLM an Evaluation Benchmark Cheater

Large language models~(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for…

Computation and Language · Computer Science 2023-11-06 Kun Zhou , Yutao Zhu , Zhipeng Chen , Wentong Chen , Wayne Xin Zhao , Xu Chen , Yankai Lin , Ji-Rong Wen , Jiawei Han

Benchmark^2: Systematic Evaluation of LLM Benchmarks

The rapid proliferation of benchmarks for evaluating large language models (LLMs) has created an urgent need for systematic methods to assess benchmark quality itself. We propose Benchmark^2, a comprehensive framework comprising three…

Computation and Language · Computer Science 2026-01-08 Qi Qian , Chengsong Huang , Jingwen Xu , Changze Lv , Muling Wu , Wenhao Liu , Xiaohua Wang , Zhenghua Wang , Zisu Huang , Muzhao Tian , Jianhan Xu , Kun Hu , He-Da Wang , Yao Hu , Xuanjing Huang , Xiaoqing Zheng

The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?

The pursuit of leaderboard rankings in Large Language Models (LLMs) has created a fundamental paradox: models excel at standardized tests while failing to demonstrate genuine language understanding and adaptability. Our systematic analysis…

Computation and Language · Computer Science 2024-12-06 Sourav Banerjee , Ayushi Agarwal , Eishkaran Singh

Large Language Models in Computer Science Education: A Systematic Literature Review

Large language models (LLMs) are becoming increasingly better at a wide range of Natural Language Processing tasks (NLP), such as text generation and understanding. Recently, these models have extended their capabilities to coding tasks,…

Machine Learning · Computer Science 2024-10-23 Nishat Raihan , Mohammed Latif Siddiq , Joanna C. S. Santos , Marcos Zampieri

A Survey of LLM-Based Applications in Programming Education: Balancing Automation and Human Oversight

Novice programmers benefit from timely, personalized support that addresses individual learning gaps, yet the availability of instructors and teaching assistants is inherently limited. Large language models (LLMs) present opportunities to…

Computers and Society · Computer Science 2025-10-07 Griffin Pitts , Anurata Prabha Hridi , Arun-Balajiee Lekshmi-Narayanan

BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks

Large language models (LLMs) are powerful tools capable of handling diverse tasks. Comparing and selecting appropriate LLMs for specific tasks requires systematic evaluation methods, as models exhibit varying capabilities across different…

Computation and Language · Computer Science 2025-06-04 Anna Sokol , Elizabeth Daly , Michael Hind , David Piorkowski , Xiangliang Zhang , Nuno Moniz , Nitesh Chawla

Towards Pedagogical LLMs with Supervised Fine Tuning for Computing Education

This paper investigates supervised fine-tuning of large language models (LLMs) to improve their pedagogical alignment in computing education, addressing concerns that LLMs may hinder learning outcomes. The project utilised a proprietary…

Computation and Language · Computer Science 2024-11-05 Alexandra Vassar , Jake Renzella , Emily Ross , Andrew Taylor

Empirical Evaluation of Large Language Models in Automated Program Repair

The increasing prevalence of software bugs has made automated program repair (APR) a key research focus. Large language models (LLMs) offer new opportunities for APR, but existing studies mostly rely on smaller, earlier-generation models…

Software Engineering · Computer Science 2025-06-17 Jiajun Sun , Fengjie Li , Xinzhu Qi , Hongyu Zhang , Jiajun Jiang

Benchmark Dataset Generation and Evaluation for Excel Formula Repair with LLMs

Excel is a pervasive yet often complex tool, particularly for novice users, where runtime errors arising from logical mistakes or misinterpretations of functions pose a significant challenge. While large language models (LLMs) offer…

Software Engineering · Computer Science 2025-08-19 Ananya Singha , Harshita Sahijwani , Walt Williams , Emmanuel Aboah Boateng , Nick Hausman , Miguel Di Luca , Keegan Choudhury , Chaya Binet , Vu Le , Tianwei Chen , Oryan Rokeah Chen , Sulaiman Vesal , Sadid Hasan

UCRBench: Benchmarking LLMs on Use Case Recovery

Use cases are widely employed to specify functional requirements, yet existing benchmarks are scarce and face the risk of being misaligned with actual system behavior, similarly limiting the rigorous evaluation of large language models…

Software Engineering · Computer Science 2025-12-16 Shuyuan Xiao , Yiran Zhang , Weisong Sun , Xiaohong Chen , Yang Liu , Zhi Jin

A Novel Psychometrics-Based Approach to Developing Professional Competency Benchmark for Large Language Models

The era of large language models (LLM) raises questions not only about how to train models, but also about how to evaluate them. Despite numerous existing benchmarks, insufficient attention is often given to creating assessments that test…

Computation and Language · Computer Science 2024-11-04 Elena Kardanova , Alina Ivanova , Ksenia Tarasova , Taras Pashchenko , Aleksei Tikhoniuk , Elen Yusupova , Anatoly Kasprzhak , Yaroslav Kuzminov , Ekaterina Kruchinskaia , Irina Brun

Summary-Mediated Repair: Can LLMs use code summarisation as a tool for program repair?

Large Language Models (LLMs) often produce code with subtle implementation-level bugs despite strong benchmark performance. These errors are hard for LLMs to spot and can have large behavioural effects; yet when asked to summarise code,…

Software Engineering · Computer Science 2025-11-25 Lukas Twist

Enterprise Benchmarks for Large Language Model Evaluation

The advancement of large language models (LLMs) has led to a greater challenge of having a rigorous and systematic evaluation of complex tasks performed, especially in enterprise applications. Therefore, LLMs need to be able to benchmark…

Computation and Language · Computer Science 2024-10-18 Bing Zhang , Mikio Takeuchi , Ryo Kawahara , Shubhi Asthana , Md. Maruf Hossain , Guang-Jie Ren , Kate Soule , Yada Zhu