Related papers: FeatureBench: Benchmarking Agentic Coding for Comp…

FeatBench: Towards More Realistic Evaluation of Feature-level Code Generation

Evaluating Large Language Models (LLMs) on repository-level feature implementation is a critical frontier in software engineering. However, establishing a benchmark that faithfully mirrors realistic development scenarios remains a…

Computation and Language · Computer Science 2026-02-19 Haorui Chen , Chengze Li , Jia Li

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture…

Artificial Intelligence · Computer Science 2026-04-24 Keyu Li , Junhao Shi , Yang Xiao , Mohan Jiang , Jie Sun , Yunze Wu , Dayuan Fu , Shijie Xia , Xiaojie Cai , Tianze Xu , Weiye Si , Wenjie Li , Dequan Wang , Pengfei Liu

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

The evolution of Large Language Models (LLMs) into autonomous agents has expanded the scope of AI coding from localized code generation to complex, repository-level, and execution-driven problem solving. However, current benchmarks…

Software Engineering · Computer Science 2026-01-19 Jie Yang , Honglin Guo , Li Ji , Jiazheng Zhou , Rui Zheng , Zhikai Lei , Shuo Zhang , Zhiheng Xi , Shichun Liu , Yuxin Wang , Bo Wang , Yining Zheng , Tao Gui , Xipeng Qiu

ProgramBench: Can Language Models Rebuild Programs From Scratch?

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings…

Software Engineering · Computer Science 2026-05-06 John Yang , Kilian Lieret , Jeffrey Ma , Parth Thakkar , Dmitrii Pedchenko , Sten Sootla , Emily McMilin , Pengcheng Yin , Rui Hou , Gabriel Synnaeve , Diyi Yang , Ofir Press

FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation

Large Language Models (LLMs) have made significant strides in front-end code generation. However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end…

Software Engineering · Computer Science 2025-06-19 Hongda Zhu , Yiwen Zhang , Bing Zhao , Jingzhe Ding , Siyao Liu , Tong Liu , Dandan Wang , Yanan Liu , Zhaojian Li

Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation

Recent advances in code agents have enabled automated software development at the project level, supported by large language models (LLMs). However, existing benchmarks for code agent evaluation face two major limitations. First, creating…

Software Engineering · Computer Science 2026-03-24 Lingyue Fu , Bolun Zhang , Hao Guan , Yaoming Zhu , Lin Qiu , Weiwen Liu , Xuezhi Cao , Xunliang Cai , Weinan Zhang , Yong Yu

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce…

Artificial Intelligence · Computer Science 2026-03-17 Shengda Fan , Xuyan Ye , Yupeng Huo , Zhi-Yuan Chen , Yiju Guo , Shenzhi Yang , Wenkai Yang , Shuqi Ye , Jingwen Chen , Haotian Chen , Xin Cong , Yankai Lin

GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging

Beyond scratch coding, exploiting large-scale code repositories (e.g., GitHub) for practical tasks is vital in real-world software development, yet current benchmarks rarely evaluate code agents in such authentic, workflow-driven scenarios.…

Software Engineering · Computer Science 2025-09-16 Ziyi Ni , Huacan Wang , Shuo Zhang , Shuo Lu , Ziyang He , Wang You , Zhenheng Tang , Yuntao Du , Bill Sun , Hongzhang Liu , Sen Hu , Ronghao Chen , Bo Li , Xin Li , Chen Hu , Binxing Jiao , Daxin Jiang , Pin Lyu

AgentBench: Evaluating LLMs as Agents

The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present…

Artificial Intelligence · Computer Science 2025-10-07 Xiao Liu , Hao Yu , Hanchen Zhang , Yifan Xu , Xuanyu Lei , Hanyu Lai , Yu Gu , Hangliang Ding , Kaiwen Men , Kejuan Yang , Shudan Zhang , Xiang Deng , Aohan Zeng , Zhengxiao Du , Chenhui Zhang , Sheng Shen , Tianjun Zhang , Yu Su , Huan Sun , Minlie Huang , Yuxiao Dong , Jie Tang

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System

The integration of Large Language Models (LLMs) into software engineering has driven a transition from traditional rule-based systems to autonomous agentic systems capable of solving complex problems. However, systematic progress is…

Software Engineering · Computer Science 2025-10-24 Jiale Guo , Suizhi Huang , Mei Li , Dong Huang , Xingsheng Chen , Regina Zhang , Zhijiang Guo , Han Yu , Siu-Ming Yiu , Pietro Lio , Kwok-Yan Lam

FormulaCode: Evaluating Agentic Optimization on Large Codebases

Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on…

Software Engineering · Computer Science 2026-05-18 Atharva Sehgal , James Hou , Akanksha Sarkar , Ishaan Mantripragada , Swarat Chaudhuri , Jennifer J. Sun , Yisong Yue

WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

The emergence of Large Language Models (LLMs) has catalyzed a paradigm shift in programming, giving rise to "vibe coding", where users can build complete projects and even control computers using natural language instructions. This paradigm…

Software Engineering · Computer Science 2026-03-27 Fanheng Kong , Jingyuan Zhang , Yang Yue , Chenxi Sun , Yang Tian , Shi Feng , Xiaocui Yang , Daling Wang , Yu Tian , Jun Du , Wenchong Zeng , Han Li , Kun Gai

An Agentic Evaluation Framework for AI-Generated Scientific Code in PETSc

While large language models have significantly accelerated scientific code generation, comprehensively evaluating the generated code remains a major challenge. Traditional benchmarks reduce evaluation to test-case matching, an approach…

Artificial Intelligence · Computer Science 2026-03-18 Hong Zhang , Barry Smith , Satish Balay , Le Chen , Murat Keceli , Lois Curfman McInnes , Junchao Zhang

OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding

Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and persist across interactions. To fill this…

Computation and Language · Computer Science 2026-01-19 Deming Ding , Shichun Liu , Enhui Yang , Jiahang Lin , Ziying Chen , Shihan Dou , Honglin Guo , Weiyu Cheng , Pengyu Zhao , Chengjun Xiao , Qunhong Zeng , Qi Zhang , Xuanjing Huang , Qidi Xu , Tao Gui

ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development

Recent coding agents can generate complete codebases from simple prompts, yet existing evaluations focus on issue-level bug fixing and lag behind end-to-end development. We introduce ProjDevBench, an end-to-end benchmark that provides…

Artificial Intelligence · Computer Science 2026-02-10 Pengrui Lu , Shiqi Zhang , Yunzhong Hou , Lyumanshan Ye , Chaoyi Huang , Zixi Chen , Ji Zeng , Hantao Jiang , Pengfei Liu , Yiwei Wang , Ming-Hsuan Yang

RExBench: Can coding agents autonomously implement AI research extensions?

Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research…

Computation and Language · Computer Science 2026-04-23 Nicholas Edwards , Yukyung Lee , Yujun Audrey Mao , Yulu Qin , Sebastian Schuster , Najoung Kim

RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code

Recent advances in language model (LM) agents and function calling have enabled autonomous, feedback-driven systems to solve problems across various digital domains. To better understand the unique limitations of LM agents, we introduce…

Artificial Intelligence · Computer Science 2025-03-12 Dhruv Gautam , Spandan Garg , Jinu Jang , Neel Sundaresan , Roshanak Zilouchian Moghaddam

InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research

AI agents could accelerate scientific discovery by automating hypothesis formation, experiment design, coding, execution, and analysis, yet existing benchmarks probe narrow skills in simplified settings. To address this gap, we introduce…

Artificial Intelligence · Computer Science 2025-11-04 Yunze Wu , Dayuan Fu , Weiye Si , Zhen Huang , Mohan Jiang , Keyu Li , Shijie Xia , Jie Sun , Tianze Xu , Xiangkun Hu , Pengrui Lu , Xiaojie Cai , Lyumanshan Ye , Wenhong Zhu , Yang Xiao , Pengfei Liu

AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition

Recent advances in large language models have enabled LLM-based agents to achieve strong performance on a variety of benchmarks. However, their performance in real-world deployments often that observed on benchmark settings, especially in…

Artificial Intelligence · Computer Science 2026-02-19 Ruipeng Wang , Yuxin Chen , Yukai Wang , Chang Wu , Junfeng Fang , Xiaodong Cai , Qi Gu , Hui Su , An Zhang , Xiang Wang , Xunliang Cai , Tat-Seng Chua

Agent psychometrics: Task-level performance prediction in agentic coding benchmarks

As the focus in LLM-based coding shifts from static single-step code generation to multi-step agentic interaction with tools and environments, understanding which tasks will challenge agents and why becomes increasingly difficult. This is…

Artificial Intelligence · Computer Science 2026-04-02 Chris Ge , Daria Kryvosheieva , Daniel Fried , Uzay Girit , Kaivalya Hariharan