Related papers: ProjectEval: A Benchmark for Programming Agents Au…

Towards Realistic Project-Level Code Generation via Multi-Agent Collaboration and Semantic Architecture Modeling

In recent years, Large Language Models (LLMs) have achieved remarkable progress in automated code generation. In real-world software engineering, the growing demand for rapid iteration and continuous delivery underscores the importance of…

Software Engineering · Computer Science 2025-11-06 Qianhui Zhao , Li Zhang , Fang Liu , Junhang Cheng , Chengru Wu , Junchen Ai , Qiaoyuanhe Meng , Lichen Zhang , Xiaoli Lian , Shubin Song , Yuanping Guo

Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation

Recent advances in code agents have enabled automated software development at the project level, supported by large language models (LLMs). However, existing benchmarks for code agent evaluation face two major limitations. First, creating…

Software Engineering · Computer Science 2026-03-24 Lingyue Fu , Bolun Zhang , Hao Guan , Yaoming Zhu , Lin Qiu , Weiwen Liu , Xuezhi Cao , Xunliang Cai , Weinan Zhang , Yong Yu

The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers

Evaluation of large language models for code has primarily relied on static benchmarks, including HumanEval (Chen et al., 2021), or more recently using human preferences of LLM responses. As LLMs are increasingly used as programmer…

Software Engineering · Computer Science 2024-10-16 Hussein Mozannar , Valerie Chen , Mohammed Alsobay , Subhro Das , Sebastian Zhao , Dennis Wei , Manish Nagireddy , Prasanna Sattigeri , Ameet Talwalkar , David Sontag

CentaurEval: Benchmarking Human-in-the-Loop Value in Agentic Coding

LLM-powered coding agents are reshaping the development paradigm. However, existing evaluation systems, neither traditional tests for humans nor benchmarks for LLMs, fail to capture this shift, excluding problems that require both human…

Software Engineering · Computer Science 2026-05-22 Hanjun Luo , Chiming Ni , Jiaheng Wen , Zhimu Huang , Yiran Wang , Bingduo Liao , Sylvia Chung , Yingbin Jin , Xinfeng Li , Wenyuan Xu , XiaoFeng Wang , Hanan Salam

ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development

Recent coding agents can generate complete codebases from simple prompts, yet existing evaluations focus on issue-level bug fixing and lag behind end-to-end development. We introduce ProjDevBench, an end-to-end benchmark that provides…

Artificial Intelligence · Computer Science 2026-02-10 Pengrui Lu , Shiqi Zhang , Yunzhong Hou , Lyumanshan Ye , Chaoyi Huang , Zixi Chen , Ji Zeng , Hantao Jiang , Pengfei Liu , Yiwei Wang , Ming-Hsuan Yang

ContractEval: A Benchmark for Evaluating Contract-Satisfying Assertions in Code Generation

Current code generation evaluation measures functional correctness on well-formed inputs that satisfy all input preconditions. This paradigm has a critical limitation: task descriptions often leave these preconditions implicit, while…

Artificial Intelligence · Computer Science 2026-04-21 Soohan Lim , Joonghyuk Hahn , Hyunwoo Park , Sang-Ki Ko , Yo-Sub Han

SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys

LLM-based automatic survey systems are transforming how users acquire information from the web by integrating retrieval, organization, and content synthesis into end-to-end generation pipelines. While recent works focus on developing new…

Computation and Language · Computer Science 2025-12-03 Jiahao Zhao , Shuaixing Zhang , Nan Xu , Lei Wang

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation…

Computation and Language · Computer Science 2026-03-11 Chengyu Shen , Yanheng Hou , Minghui Pan , Runming He , Zhen Hao Wong , Meiyi Qiang , Zhou Liu , Hao Liang , Peichao Lai , Zeang Sheng , Wentao Zhang

Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion?

Code completion, a key downstream task in code generation, is one of the most frequent and impactful methods for enhancing developer productivity in software development. As intelligent completion tools evolve, we need a robust evaluation…

Software Engineering · Computer Science 2024-10-25 Zhenyu Pan , Rongyu Cao , Yongchang Cao , Yingwei Ma , Binhua Li , Fei Huang , Han Liu , Yongbin Li

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human…

Computation and Language · Computer Science 2023-08-15 Chi-Min Chan , Weize Chen , Yusheng Su , Jianxuan Yu , Wei Xue , Shanghang Zhang , Jie Fu , Zhiyuan Liu

An LLM-based multi-agent framework for agile effort estimation

Effort estimation is a crucial activity in agile software development, where teams collaboratively review, discuss, and estimate the effort required to complete user stories in a product backlog. Current practices in agile effort estimation…

Software Engineering · Computer Science 2025-09-19 Thanh-Long Bui , Hoa Khanh Dam , Rashina Hoda

RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

Writing code requires significant time and effort in software development. To automate this process, researchers have made substantial progress using Large Language Models (LLMs) for code generation. Many benchmarks like HumanEval and…

Software Engineering · Computer Science 2026-04-27 Jia Li , Hongyi Deng , Yiran Zhang , Kechi Zhang , Tianqi Shao , Tiankuo Zhao , Weinan Wang , Zhi Jin , Ge Li , Yang Liu , Yingtao Fang , Yihong Dong

Assessing and Verifying Task Utility in LLM-Powered Applications

The rapid development of Large Language Models (LLMs) has led to a surge in applications that facilitate collaboration among multiple agents, assisting humans in their daily tasks. However, a significant gap remains in assessing to what…

Computation and Language · Computer Science 2024-05-14 Negar Arabzadeh , Siqing Huo , Nikhil Mehta , Qinqyun Wu , Chi Wang , Ahmed Awadallah , Charles L. A. Clarke , Julia Kiseleva

mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

Recent advancements in large language models (LLMs) have significantly enhanced code generation from natural language prompts. The HumanEval Benchmark, developed by OpenAI, remains the most widely used code generation benchmark. However,…

Computation and Language · Computer Science 2025-05-19 Nishat Raihan , Antonios Anastasopoulos , Marcos Zampieri

Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval

Existing code generation benchmarks for Large Language Models (LLMs) such as HumanEval and MBPP are designed to study LLMs' end-to-end performance, where the benchmarks feed a problem description in natural language as input and examine the…

Software Engineering · Computer Science 2025-02-27 Jiarong Wu , Songqiang Chen , Jialun Cao , Hau Ching Lo , Shing-Chi Cheung

FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation

Large Language Models (LLMs) have made significant strides in front-end code generation. However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end…

Software Engineering · Computer Science 2025-06-19 Hongda Zhu , Yiwen Zhang , Bing Zhao , Jingzhe Ding , Siyao Liu , Tong Liu , Dandan Wang , Yanan Liu , Zhaojian Li

How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark

The emergence of large language models (LLMs) has significantly pushed the frontiers of program synthesis. Advancement of LLM-based program synthesis calls for a thorough evaluation of LLM-generated code. Most evaluation frameworks focus on…

Software Engineering · Computer Science 2025-02-20 Ruizhong Qiu , Weiliang Will Zeng , James Ezick , Christopher Lott , Hanghang Tong

StackEval: Benchmarking LLMs in Coding Assistance

We present two comprehensive benchmarks to evaluate the performance of language models in coding assistance tasks, covering code writing, debugging, code review, and conceptual understanding. Our main contribution includes two curated…

Software Engineering · Computer Science 2024-12-10 Nidhish Shah , Zulkuf Genc , Dogu Araci

CodeEval: A pedagogical approach for targeted evaluation of code-trained Large Language Models

Large Language Models (LLMs) are predominantly assessed based on their common sense reasoning, language comprehension, and logical reasoning abilities. While models trained in specialized domains like mathematics or coding have demonstrated…

Software Engineering · Computer Science 2026-01-08 Danny Brahman , Mohammad Mahoor

HumanEvo: An Evolution-aware Benchmark for More Realistic Evaluation of Repository-level Code Generation

To evaluate the repository-level code generation capabilities of Large Language Models (LLMs) in complex real-world software development scenarios, many evaluation methods have been developed. These methods typically leverage contextual…

Software Engineering · Computer Science 2025-03-19 Dewu Zheng , Yanlin Wang , Ensheng Shi , Ruikai Zhang , Yuchi Ma , Hongyu Zhang , Zibin Zheng