Related papers: FreshBrew: A Benchmark for Evaluating AI Agents on…

JMigBench: A Benchmark for Evaluating LLMs on Source Code Migration (Java 8 to Java 11)

We build a benchmark to evaluate large language models (LLMs) for source code migration tasks, specifically upgrading functions from Java 8 to Java 11. We first collected a dataset of function pairs from open-source repositories, but…

Software Engineering · Computer Science 2026-02-11 Nishil Amin , Zhiwei Fei , Xiang Li , Justyna Petke , He Ye

Evaluating LLM Agents on Automated Software Analysis Tasks

Numerous software analysis tools exist today, yet applying them to diverse open-source projects remains challenging due to environment setup, dependency resolution, and tool configuration. LLM-based agents offer a potential solution, yet no…

Software Engineering · Computer Science 2026-04-20 Islem Bouzenia , Cristian Cadar , Michael Pradel

Continuous Benchmark Generation for Evaluating Enterprise-scale LLM Agents

The rapid adoption of AI agents across domains has made systematic evaluation crucial for ensuring their usefulness and successful production deployment. Evaluation of AI agents typically involves using a fixed set of benchmarks and…

Software Engineering · Computer Science 2025-11-14 Divyanshu Saxena , Rishikesh Maurya , Xiaoxuan Ou , Gagan Somashekar , Shachee Mishra Gupta , Arun Iyer , Yu Kang , Chetan Bansal , Aditya Akella , Saravan Rajmohan

MigrationBench: Repository-Level Code Migration Benchmark from Java 8

With the rapid advancement of powerful large language models (LLMs) in recent years, a wide range of software engineering tasks can now be addressed using LLMs, significantly enhancing productivity and scalability. Numerous benchmark…

Software Engineering · Computer Science 2026-05-29 Linbo Liu , Xinle Liu , Qiang Zhou , Lin Chen , Yihan Liu , Hoan Nguyen , Behrooz Omidvar-Tehrani , Xi Shen , Jun Huan , Omer Tripp , Anoop Deoras

Can AI Agents Design and Implement Drug Discovery Pipelines?

The rapid advancement of artificial intelligence, particularly autonomous agentic systems based on Large Language Models (LLMs), presents new opportunities to accelerate drug discovery by improving in-silico modeling and reducing dependence…

Artificial Intelligence · Computer Science 2025-04-29 Khachik Smbatyan , Tsolak Ghukasyan , Tigran Aghajanyan , Hovhannes Dabaghyan , Sergey Adamyan , Aram Bughdaryan , Vahagn Altunyan , Gagik Navasardyan , Aram Davtyan , Anush Hakobyan , Aram Gharibyan , Arman Fahradyan , Artur Hakobyan , Hasmik Mnatsakanyan , Narek Ginoyan , Garik Petrosyan

Can AI Agents Answer Your Data Questions? A Benchmark for Data Agents

Users across enterprises increasingly rely on AI agents to query their data through natural language. However, building reliable data agents remains difficult because real-world data is often fragmented across multiple heterogeneous…

Databases · Computer Science 2026-03-24 Ruiying Ma , Shreya Shankar , Ruiqi Chen , Yiming Lin , Sepanta Zeighami , Rajoshi Ghosh , Abhinav Gupta , Anushrut Gupta , Tanmai Gopal , Aditya G. Parameswaran

GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git

Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS)…

Software Engineering · Computer Science 2025-05-29 Tobias Lindenbauer , Egor Bogomolov , Yaroslav Zharov

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

We present SorryDB, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark…

Artificial Intelligence · Computer Science 2026-03-04 Austin Letson , Leopoldo Sarra , Auguste Poiroux , Oliver Dressler , Paul Lezeau , Dhyan Aranha , Frederick Pu , Aaron Hill , Miguel Corredera Hidalgo , Julian Berman , George Tsoukalas , Lenny Taelman

EvoClaw: Evaluating AI Agents on Continuous Software Evolution

With AI agents increasingly deployed as long-running systems, it becomes essential to autonomously construct and continuously evolve customized software to enable interaction within dynamic environments. Yet, existing benchmarks evaluate…

Software Engineering · Computer Science 2026-03-17 Gangda Deng , Zhaoling Chen , Zhongming Yu , Haoyang Fan , Yuhong Liu , Yuxin Yang , Dhruv Parikh , Rajgopal Kannan , Le Cong , Mengdi Wang , Qian Zhang , Viktor Prasanna , Xiangru Tang , Xingyao Wang

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current…

Software Engineering · Computer Science 2026-02-12 Qixing Zhou , Jiacheng Zhang , Haiyang Wang , Rui Hao , Jiahe Wang , Minghao Han , Yuxue Yang , Shuzhe Wu , Feiyang Pan , Lue Fan , Dandan Tu , Zhaoxiang Zhang

RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing

The evolution of AI coding agents has shifted the frontier from simple snippet completion to autonomous repository-level engineering. However, evaluating these agents remains ill-posed in general code repository generation, where the lack…

Software Engineering · Computer Science 2026-02-27 Xuefeng Li , Nir Ben-Israel , Yotam Raz , Belal Ahmed , Doron Serebro , Antoine Raux

Beyond Black-Box Benchmarking: Observability, Analytics, and Optimization of Agentic Systems

The rise of agentic AI systems, where agents collaborate to perform diverse tasks, poses new challenges with observing, analyzing and optimizing their behavior. Traditional evaluation and benchmarking approaches struggle to handle the…

Artificial Intelligence · Computer Science 2025-03-11 Dany Moshkovich , Hadar Mulian , Sergey Zeltyn , Natti Eder , Inna Skarbovsky , Roy Abitbol

ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development

Recent coding agents can generate complete codebases from simple prompts, yet existing evaluations focus on issue-level bug fixing and lag behind end-to-end development. We introduce ProjDevBench, an end-to-end benchmark that provides…

Artificial Intelligence · Computer Science 2026-02-10 Pengrui Lu , Shiqi Zhang , Yunzhong Hou , Lyumanshan Ye , Chaoyi Huang , Zixi Chen , Ji Zeng , Hantao Jiang , Pengfei Liu , Yiwei Wang , Ming-Hsuan Yang

Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

Automated evaluation of tool-using large language model (LLM) agents is widely assumed to be reliable, but this assumption has rarely been validated against human annotation. We introduce AgentProp-Bench, a 2,000-task benchmark with 2,300…

Artificial Intelligence · Computer Science 2026-04-21 Bhaskar Gurram

FeatBench: Towards More Realistic Evaluation of Feature-level Code Generation

Evaluating Large Language Models (LLMs) on repository-level feature implementation is a critical frontier in software engineering. However, establishing a benchmark that faithfully mirrors realistic development scenarios remains a…

Computation and Language · Computer Science 2026-02-19 Haorui Chen , Chengze Li , Jia Li

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture…

Artificial Intelligence · Computer Science 2026-04-24 Keyu Li , Junhao Shi , Yang Xiao , Mohan Jiang , Jie Sun , Yunze Wu , Dayuan Fu , Shijie Xia , Xiaojie Cai , Tianze Xu , Weiye Si , Wenjie Li , Dequan Wang , Pengfei Liu

Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation

We present Agent-Diff, a novel benchmarking framework for evaluating agentic Large Language Models (LLMs) on real-world productivity software API tasks via code execution. Agentic LLM performance varies due to differences in models,…

Software Engineering · Computer Science 2026-04-29 Hubert M. Pysklo , Artem Zhuravel , Patrick D. Watson

ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

Language-model agents are increasingly used as persistent coworkers that assist users across multiple working days. During such workflows, the surrounding environment may change independently of the agent: new emails arrive, calendar…

Computer Vision and Pattern Recognition · Computer Science 2026-05-06 Fanqing Meng , Lingxiao Du , Zijian Wu , Guanzheng Chen , Xiangyan Liu , Jiaqi Liao , Chonghe Jiang , Zhenglin Wan , Jiawei Gu , Pengfei Zhou , Rui Huang , Ziqi Zhao , Shengyuan Ding , Ailing Yu , Bo Peng , Bowei Xia , Hao Sun , Haotian Liang , Ji Xie , Jiajun Chen , Jiajun Song , Liu Yang , Ming Xu , Qionglin Qiu , Runhao Fu , Shengfang Zhai , Shijian Wang , Tengfei Ma , Tianyi Wu , Weiyang Jin , Yan Wang , Yang Dai , Yao Lai , Youwei Shu , Yue Liu , Yunzhuo Hao , Yuwei Niu , Jinkai Huang , Jiayuan Zhuo , Zhennan Shen , Linyu Wu , Hannah Yao , Charles Chen , Cihang Xie , Yuyin Zhou , Jiaheng Zhang , Zeyu Zheng , Mengkang Hu , Michael Qizhe Shieh

Towards a Standard, Enterprise-Relevant Agentic AI Benchmark: Lessons from 5.5 billion tokens' worth of agentic AI evaluations

Enterprise adoption of agentic AI systems requires reliable evaluation methods that reflect real-world deployment scenarios. Traditional LLM benchmarks suffer from training data contamination and fail to measure agentic capabilities such as…

Artificial Intelligence · Computer Science 2025-11-12 JV Roig

Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Scenarios

In recent years, AI-based software engineering has progressed from pre-trained models to advanced agentic workflows, with Software Development Agents representing the next major leap. These agents, capable of reasoning, planning, and…

Software Engineering · Computer Science 2024-12-30 Zhi Chen , Lingxiao Jiang