Related papers: ProgramBench: Can Language Models Rebuild Programs…

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current…

Software Engineering · Computer Science 2026-02-12 Qixing Zhou , Jiacheng Zhang , Haiyang Wang , Rui Hao , Jiahe Wang , Minghao Han , Yuxue Yang , Shuzhe Wu , Feiyang Pan , Lue Fan , Dandan Tu , Zhaoxiang Zhang

RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code

Recent advances in language model (LM) agents and function calling have enabled autonomous, feedback-driven systems to solve problems across various digital domains. To better understand the unique limitations of LM agents, we introduce…

Artificial Intelligence · Computer Science 2025-03-12 Dhruv Gautam , Spandan Garg , Jinu Jang , Neel Sundaresan , Roshanak Zilouchian Moghaddam

SetupBench: Assessing Software Engineering Agents' Ability to Bootstrap Development Environments

Modern Large Language Model (LLM) agents promise end to end assistance with real-world software tasks, yet existing benchmarks evaluate LLM agents almost exclusively in pre-baked environments where every dependency is pre-installed. To fill…

Software Engineering · Computer Science 2025-07-15 Avi Arora , Jinu Jang , Roshanak Zilouchian Moghaddam

FeatBench: Towards More Realistic Evaluation of Feature-level Code Generation

Evaluating Large Language Models (LLMs) on repository-level feature implementation is a critical frontier in software engineering. However, establishing a benchmark that faithfully mirrors realistic development scenarios remains a…

Computation and Language · Computer Science 2026-02-19 Haorui Chen , Chengze Li , Jia Li

Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios

Large Language Models (LLMs) are driving a shift towards intent-driven development, where agents build complete software from scratch. However, existing benchmarks fail to assess this 0-to-1 generation capability due to two limitations:…

Software Engineering · Computer Science 2026-04-09 Ruida Hu , Xinchen Wang , Chao Peng , Cuiyun Gao , David Lo

RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

Writing code requires significant time and effort in software development. To automate this process, researchers have made substantial progress using Large Language Models (LLMs) for code generation. Many benchmarks like HumanEval and…

Software Engineering · Computer Science 2026-04-27 Jia Li , Hongyi Deng , Yiran Zhang , Kechi Zhang , Tianqi Shao , Tiankuo Zhao , Weinan Wang , Zhi Jin , Ge Li , Yang Liu , Yingtao Fang , Yihong Dong

FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation

Large Language Models (LLMs) have made significant strides in front-end code generation. However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end…

Software Engineering · Computer Science 2025-06-19 Hongda Zhu , Yiwen Zhang , Bing Zhao , Jingzhe Ding , Siyao Liu , Tong Liu , Dandan Wang , Yanan Liu , Zhaojian Li

AgentMesh: A Cooperative Multi-Agent Generative AI Framework for Software Development Automation

Software development is a complex, multi-phase process traditionally requiring collaboration among individuals with diverse expertise. We propose AgentMesh, a Python-based framework that uses multiple cooperating LLM-powered agents to…

Software Engineering · Computer Science 2025-07-29 Sourena Khanzadeh

RExBench: Can coding agents autonomously implement AI research extensions?

Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research…

Computation and Language · Computer Science 2026-04-23 Nicholas Edwards , Yukyung Lee , Yujun Audrey Mao , Yulu Qin , Sebastian Schuster , Najoung Kim

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

As language models (LMs) evolve from chat assistants to long-horizon agents capable of multi-step reasoning and tool use, existing benchmarks remain largely confined to structured or exam-style tasks that fall short of real-world…

Machine Learning · Computer Science 2026-03-10 Qianyu Yang , Yang Liu , Jiaqi Li , Jun Bai , Hao Chen , Kaiyuan Chen , Tiliang Duan , Jiayun Dong , Xiaobo Hu , Zixia Jia , Yang Liu , Tao Peng , Yixin Ren , Ran Tian , Zaiyuan Wang , Yanglihong Xiao , Gang Yao , Lingyue Yin , Ge Zhang , Chun Zhang , Jianpeng Jiao , Zilong Zheng , Yuan Gong

WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

The emergence of Large Language Models (LLMs) has catalyzed a paradigm shift in programming, giving rise to "vibe coding", where users can build complete projects and even control computers using natural language instructions. This paradigm…

Software Engineering · Computer Science 2026-03-27 Fanheng Kong , Jingyuan Zhang , Yang Yue , Chenxi Sun , Yang Tian , Shi Feng , Xiaocui Yang , Daling Wang , Yu Tian , Jun Du , Wenchong Zeng , Han Li , Kun Gai

CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of…

Software Engineering · Computer Science 2025-11-03 Forough Mehralian , Ryan Shar , James R. Rae , Alireza Hashemi

Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation

Recent advances in code agents have enabled automated software development at the project level, supported by large language models (LLMs). However, existing benchmarks for code agent evaluation face two major limitations. First, creating…

Software Engineering · Computer Science 2026-03-24 Lingyue Fu , Bolun Zhang , Hao Guan , Yaoming Zhu , Lin Qiu , Weiwen Liu , Xuezhi Cao , Xunliang Cai , Weinan Zhang , Yong Yu

FEABench: Evaluating Language Models on Multiphysics Reasoning Ability

Building precise simulations of the real world and invoking numerical solvers to answer quantitative problems is an essential requirement in engineering and science. We present FEABench, a benchmark to evaluate the ability of large language…

Artificial Intelligence · Computer Science 2025-04-09 Nayantara Mudur , Hao Cui , Subhashini Venugopalan , Paul Raccuglia , Michael P. Brenner , Peter Norgaard

Coding Agents Don't Know When to Act

Coding agents are increasingly deployed to autonomously maintain software, including to resolve user-reported issues: a bug report comes in and the agent creates a patch to address it. However, in any real-world deployment, they will…

Software Engineering · Computer Science 2026-05-11 Thibaud Gloaguen , Niels Mündler , Mark Müller , Veselin Raychev , Martin Vechev

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have…

Software Engineering · Computer Science 2025-04-02 Terry Yue Zhuo , Minh Chien Vu , Jenny Chim , Han Hu , Wenhao Yu , Ratnadira Widyasari , Imam Nur Bani Yusuf , Haolan Zhan , Junda He , Indraneil Paul , Simon Brunner , Chen Gong , Thong Hoang , Armel Randy Zebaze , Xiaoheng Hong , Wen-Ding Li , Jean Kaddour , Ming Xu , Zhihan Zhang , Prateek Yadav , Naman Jain , Alex Gu , Zhoujun Cheng , Jiawei Liu , Qian Liu , Zijian Wang , Binyuan Hui , Niklas Muennighoff , David Lo , Daniel Fried , Xiaoning Du , Harm de Vries , Leandro Von Werra

RepoLaunch: Automating Build&Test Pipeline of Code Repositories on ANY Language and ANY Platform

Building software repositories typically requires significant manual effort. Recent advances in large language model (LLM) agents have accelerated automation in software engineering (SWE). We introduce RepoLaunch, the first agent capable of…

Software Engineering · Computer Science 2026-03-06 Kenan Li , Rongzhi Li , Linghao Zhang , Qirui Jin , Liao Zhu , Xiaosong Huang , Geng Zhang , Yikai Zhang , Shilin He , Chengxing Xie , Xin Zhang , Zijian Jin , Bowen Li , Chaoyun Zhang , Yu Kang , Yufan Huang , Elsie Nallipogu , Saravan Rajmohan , Qingwei Lin , Dongmei Zhang

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness…

Software Engineering · Computer Science 2026-05-18 Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture…

Artificial Intelligence · Computer Science 2026-04-24 Keyu Li , Junhao Shi , Yang Xiao , Mohan Jiang , Jie Sun , Yunze Wu , Dayuan Fu , Shijie Xia , Xiaojie Cai , Tianze Xu , Weiye Si , Wenjie Li , Dequan Wang , Pengfei Liu

Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks

The application of large language models (LLMs) in the field of coding is evolving rapidly: from code assistants, to autonomous coding agents, and then to generating complete projects through natural language. Early LLM code benchmarks…

Artificial Intelligence · Computer Science 2025-05-13 Kai Xu , YiWei Mao , XinYi Guan , ZiLong Feng