English
Related papers

Related papers: ProjectEval: A Benchmark for Programming Agents Au…

200 papers

In recent years, Large Language Models (LLMs) have achieved remarkable progress in automated code generation. In real-world software engineering, the growing demand for rapid iteration and continuous delivery underscores the importance of…

Software Engineering · Computer Science 2025-11-06 Qianhui Zhao , Li Zhang , Fang Liu , Junhang Cheng , Chengru Wu , Junchen Ai , Qiaoyuanhe Meng , Lichen Zhang , Xiaoli Lian , Shubin Song , Yuanping Guo

Recent advances in code agents have enabled automated software development at the project level, supported by large language models (LLMs). However, existing benchmarks for code agent evaluation face two major limitations. First, creating…

Software Engineering · Computer Science 2026-03-24 Lingyue Fu , Bolun Zhang , Hao Guan , Yaoming Zhu , Lin Qiu , Weiwen Liu , Xuezhi Cao , Xunliang Cai , Weinan Zhang , Yong Yu

Evaluation of large language models for code has primarily relied on static benchmarks, including HumanEval (Chen et al., 2021), or more recently using human preferences of LLM responses. As LLMs are increasingly used as programmer…

LLM-powered coding agents are reshaping the development paradigm. However, existing evaluation systems, neither traditional tests for humans nor benchmarks for LLMs, fail to capture this shift, excluding problems that require both human…

Recent coding agents can generate complete codebases from simple prompts, yet existing evaluations focus on issue-level bug fixing and lag behind end-to-end development. We introduce ProjDevBench, an end-to-end benchmark that provides…

Artificial Intelligence · Computer Science 2026-02-10 Pengrui Lu , Shiqi Zhang , Yunzhong Hou , Lyumanshan Ye , Chaoyi Huang , Zixi Chen , Ji Zeng , Hantao Jiang , Pengfei Liu , Yiwei Wang , Ming-Hsuan Yang

Current code generation evaluation measures functional correctness on well-formed inputs that satisfy all input preconditions. This paradigm has a critical limitation: task descriptions often leave these preconditions implicit, while…

Artificial Intelligence · Computer Science 2026-04-21 Soohan Lim , Joonghyuk Hahn , Hyunwoo Park , Sang-Ki Ko , Yo-Sub Han

LLM-based automatic survey systems are transforming how users acquire information from the web by integrating retrieval, organization, and content synthesis into end-to-end generation pipelines. While recent works focus on developing new…

Computation and Language · Computer Science 2025-12-03 Jiahao Zhao , Shuaixing Zhang , Nan Xu , Lei Wang

Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation…

Computation and Language · Computer Science 2026-03-11 Chengyu Shen , Yanheng Hou , Minghui Pan , Runming He , Zhen Hao Wong , Meiyi Qiang , Zhou Liu , Hao Liang , Peichao Lai , Zeang Sheng , Wentao Zhang

Code completion, a key downstream task in code generation, is one of the most frequent and impactful methods for enhancing developer productivity in software development. As intelligent completion tools evolve, we need a robust evaluation…

Software Engineering · Computer Science 2024-10-25 Zhenyu Pan , Rongyu Cao , Yongchang Cao , Yingwei Ma , Binhua Li , Fei Huang , Han Liu , Yongbin Li

Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human…

Computation and Language · Computer Science 2023-08-15 Chi-Min Chan , Weize Chen , Yusheng Su , Jianxuan Yu , Wei Xue , Shanghang Zhang , Jie Fu , Zhiyuan Liu

Effort estimation is a crucial activity in agile software development, where teams collaboratively review, discuss, and estimate the effort required to complete user stories in a product backlog. Current practices in agile effort estimation…

Software Engineering · Computer Science 2025-09-19 Thanh-Long Bui , Hoa Khanh Dam , Rashina Hoda

Writing code requires significant time and effort in software development. To automate this process, researchers have made substantial progress using Large Language Models (LLMs) for code generation. Many benchmarks like HumanEval and…

Software Engineering · Computer Science 2026-04-27 Jia Li , Hongyi Deng , Yiran Zhang , Kechi Zhang , Tianqi Shao , Tiankuo Zhao , Weinan Wang , Zhi Jin , Ge Li , Yang Liu , Yingtao Fang , Yihong Dong

The rapid development of Large Language Models (LLMs) has led to a surge in applications that facilitate collaboration among multiple agents, assisting humans in their daily tasks. However, a significant gap remains in assessing to what…

Computation and Language · Computer Science 2024-05-14 Negar Arabzadeh , Siqing Huo , Nikhil Mehta , Qinqyun Wu , Chi Wang , Ahmed Awadallah , Charles L. A. Clarke , Julia Kiseleva

Recent advancements in large language models (LLMs) have significantly enhanced code generation from natural language prompts. The HumanEval Benchmark, developed by OpenAI, remains the most widely used code generation benchmark. However,…

Computation and Language · Computer Science 2025-05-19 Nishat Raihan , Antonios Anastasopoulos , Marcos Zampieri

Existing code generation benchmarks for Large Language Models (LLMs) such as HumanEval and MBPP are designed to study LLMs' end-to-end performance, where the benchmarks feed a problem description in natural language as input and examine the…

Software Engineering · Computer Science 2025-02-27 Jiarong Wu , Songqiang Chen , Jialun Cao , Hau Ching Lo , Shing-Chi Cheung

Large Language Models (LLMs) have made significant strides in front-end code generation. However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end…

Software Engineering · Computer Science 2025-06-19 Hongda Zhu , Yiwen Zhang , Bing Zhao , Jingzhe Ding , Siyao Liu , Tong Liu , Dandan Wang , Yanan Liu , Zhaojian Li

The emergence of large language models (LLMs) has significantly pushed the frontiers of program synthesis. Advancement of LLM-based program synthesis calls for a thorough evaluation of LLM-generated code. Most evaluation frameworks focus on…

Software Engineering · Computer Science 2025-02-20 Ruizhong Qiu , Weiliang Will Zeng , James Ezick , Christopher Lott , Hanghang Tong

We present two comprehensive benchmarks to evaluate the performance of language models in coding assistance tasks, covering code writing, debugging, code review, and conceptual understanding. Our main contribution includes two curated…

Software Engineering · Computer Science 2024-12-10 Nidhish Shah , Zulkuf Genc , Dogu Araci

Large Language Models (LLMs) are predominantly assessed based on their common sense reasoning, language comprehension, and logical reasoning abilities. While models trained in specialized domains like mathematics or coding have demonstrated…

Software Engineering · Computer Science 2026-01-08 Danny Brahman , Mohammad Mahoor

To evaluate the repository-level code generation capabilities of Large Language Models (LLMs) in complex real-world software development scenarios, many evaluation methods have been developed. These methods typically leverage contextual…

Software Engineering · Computer Science 2025-03-19 Dewu Zheng , Yanlin Wang , Ensheng Shi , Ruikai Zhang , Yuchi Ma , Hongyu Zhang , Zibin Zheng
‹ Prev 1 2 3 10 Next ›