Related papers: FixEval: Execution-based Evaluation of Program Fix…

FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models

The rapid development of large language model (LLM) evaluation methodologies and datasets has led to a profound challenge: integrating state-of-the-art evaluation techniques cost-effectively while ensuring reliability, reproducibility, and…

Computation and Language · Computer Science 2024-04-10 Zhuohao Yu , Chang Gao , Wenjin Yao , Yidong Wang , Zhengran Zeng , Wei Ye , Jindong Wang , Yue Zhang , Shikun Zhang

RepoMasterEval: Evaluating Code Completion via Real-World Repositories

With the growing reliance on automated code completion tools in software development, the need for comprehensive evaluation benchmarks has become critical. Existing benchmarks focus more on code completion in function and class level by…

Software Engineering · Computer Science 2025-11-03 Qinyun Wu , Chao Peng , Pengfei Gao , Ruida Hu , Haoyu Gan , Bo Jiang , Jinhe Tang , Zhiwen Deng , Zhanming Guan , Cuiyun Gao , Xia Liu , Ping Yang

EditEval: An Instruction-Based Benchmark for Text Improvements

Evaluation of text generation to date has primarily focused on content created sequentially, rather than improvements on a piece of text. Writing, however, is naturally an iterative and incremental process that requires expertise in…

Computation and Language · Computer Science 2022-09-28 Jane Dwivedi-Yu , Timo Schick , Zhengbao Jiang , Maria Lomeli , Patrick Lewis , Gautier Izacard , Edouard Grave , Sebastian Riedel , Fabio Petroni

Fixturize: Bridging the Fixture Gap in Test Generation

Current Large Language Models (LLMs) have advanced automated unit test generation but face a critical limitation: they often neglect to construct the necessary test fixtures, which are the environmental setups required for a test to run. To…

Software Engineering · Computer Science 2026-03-26 Chengyi Wang , Pengyu Xue , Zhen Yang , Xiapu Luo , Yuxuan Zhang , Xiran Lyu , Yifei Pei , Zonghan Jia , Yichen Sun , Linhao Wu , Kunwu Zheng

FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks

Code repair is a fundamental task in software development, facilitating efficient bug resolution and software maintenance. Although large language models (LLMs) have demonstrated considerable potential in automated code repair, their…

Software Engineering · Computer Science 2026-02-27 Dekun Dai , MingWei Liu , Anji Li , Jialun Cao , Yanlin Wang , Chong Wang , Xin Peng , Zibin Zheng

ReleaseEval: A Benchmark for Evaluating Language Models in Automated Release Note Generation

Automated release note generation addresses the challenge of documenting frequent software updates, where manual efforts are time-consuming and prone to human error. Although recent advances in language models further enhance this process,…

Software Engineering · Computer Science 2025-11-05 Qianru Meng , Zhaochun Ren , Joost Visser

ScratchEval : A Multimodal Evaluation Framework for LLMs in Block-Based Programming

LLMs have achieved strong performance on text-based programming tasks, yet they remain unreliable for block-based languages such as Scratch. Scratch programs exhibit deeply nested, non-linear structures, event-driven concurrency across…

Software Engineering · Computer Science 2026-02-03 Yuan Si , Simeng Han , Daming Li , Hanyuan Shi , Jialu Zhang

xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval

Recently, pre-trained large language models (LLMs) have shown impressive abilities in generating codes from natural language descriptions, repairing buggy codes, translating codes between languages, and retrieving relevant code segments.…

Computation and Language · Computer Science 2023-11-07 Mohammad Abdullah Matin Khan , M Saiful Bari , Xuan Long Do , Weishi Wang , Md Rizwan Parvez , Shafiq Joty

AdaptEval: A Benchmark for Evaluating Large Language Models on Code Snippet Adaptation

Recent advancements in large language models (LLMs) have automated various software engineering tasks, with benchmarks emerging to evaluate their capabilities. However, for adaptation, a critical activity during code reuse, there is no…

Software Engineering · Computer Science 2026-01-09 Tanghaoran Zhang , Xinjun Mao , Shangwen Wang , Yuxin Zhao , Yao Lu , Jin Zhang , Zhang Zhang , Kang Yang , Yue Yu

EffiEval: Efficient and Generalizable Model Evaluation via Capability Coverage Maximization

The rapid advancement of large language models (LLMs) and the development of increasingly large and diverse evaluation benchmarks have introduced substantial computational challenges for model assessment. In this paper, we present EffiEval,…

Computation and Language · Computer Science 2025-08-14 Yaoning Wang , Jiahao Ying , Yixin Cao , Yubo Ma , Yugang Jiang

A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models

Automatically resolving software issues is crucial for software development in practice, impacting the software quality and user experience. The process of resolving real-world issues encompasses tasks such as question-answering (QA), fault…

Software Engineering · Computer Science 2024-11-28 Ruida Hu , Chao Peng , Jingyi Ren , Bo Jiang , Xiangxin Meng , Qinyun Wu , Pengfei Gao , Xinchen Wang , Cuiyun Gao

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

Testing plays a crucial role in the software development cycle, enabling the detection of bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers need to write code snippets that execute the program…

Software Engineering · Computer Science 2025-02-04 Wenhan Wang , Chenyuan Yang , Zhijie Wang , Yuheng Huang , Zhaoyang Chu , Da Song , Lingming Zhang , An Ran Chen , Lei Ma

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark

Code generation models can help improve many common software tasks ranging from code completion to defect prediction. Most of the existing benchmarks for code generation LLMs focus on code authoring or code completion. Surprisingly, there…

Software Engineering · Computer Science 2025-03-20 Kush Jain , Gabriel Synnaeve , Baptiste Rozière

LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software

Logical vulnerabilities in software stem from flaws in program logic rather than memory safety, which can lead to critical security failures. Although existing automated program repair techniques primarily focus on repairing memory…

Cryptography and Security · Computer Science 2026-04-24 Syed Md Mukit Rashid , Abdullah Al Ishtiaq , Kai Tu , Yilu Dong , Tianwei Wu , Ali Ranjbar , Tianchang Yang , Najrin Sultana , Shagufta Mehnaz , Syed Rafiul Hussain

ContractEval: A Benchmark for Evaluating Contract-Satisfying Assertions in Code Generation

Current code generation evaluation measures functional correctness on well-formed inputs that satisfy all input preconditions. This paradigm has a critical limitation: task descriptions often leave these preconditions implicit, while…

Artificial Intelligence · Computer Science 2026-04-21 Soohan Lim , Joonghyuk Hahn , Hyunwoo Park , Sang-Ki Ko , Yo-Sub Han

MdEval: Massively Multilingual Code Debugging

Code large language models (LLMs) have made significant progress in code debugging by directly generating the correct code based on the buggy code snippet. Programming benchmarks, typically consisting of buggy code snippet and their…

Computation and Language · Computer Science 2025-02-25 Shukai Liu , Linzheng Chai , Jian Yang , Jiajun Shi , He Zhu , Liran Wang , Ke Jin , Wei Zhang , Hualei Zhu , Shuyue Guo , Tao Sun , Jiaheng Liu , Yunlong Duan , Yu Hao , Liqun Yang , Guanglin Niu , Ge Zhang , Zhoujun Li

Towards a Benchmark Set for Program Repair Based on Partial Fixes

Software bugs significantly contribute to software cost and increase the risk of system malfunctioning. In recent years, many automated program-repair approaches have been proposed to automatically fix undesired program behavior. Despite of…

Software Engineering · Computer Science 2021-07-19 Dirk Beyer , Lars Grunske , Thomas Lemberger , Minxing Tang

CodeEval: A pedagogical approach for targeted evaluation of code-trained Large Language Models

Large Language Models (LLMs) are predominantly assessed based on their common sense reasoning, language comprehension, and logical reasoning abilities. While models trained in specialized domains like mathematics or coding have demonstrated…

Software Engineering · Computer Science 2026-01-08 Danny Brahman , Mohammad Mahoor

REEF: A Framework for Collecting Real-World Vulnerabilities and Fixes

Software plays a crucial role in our daily lives, and therefore the quality and security of software systems have become increasingly important. However, vulnerabilities in software still pose a significant threat, as they can have serious…

Software Engineering · Computer Science 2023-09-18 Chaozheng Wang , Zongjie Li , Yun Peng , Shuzheng Gao , Sirong Chen , Shuai Wang , Cuiyun Gao , Michael R. Lyu

ProjectEval: A Benchmark for Programming Agents Automated Evaluation on Project-Level Code Generation

Recently, LLM agents have made rapid progress in improving their programming capabilities. However, existing benchmarks lack the ability to automatically evaluate from users' perspective, and also lack the explainability of the results of…

Software Engineering · Computer Science 2025-06-03 Kaiyuan Liu , Youcheng Pan , Yang Xiang , Daojing He , Jing Li , Yexing Du , Tianrun Gao