English
Related papers

Related papers: FreshBrew: A Benchmark for Evaluating AI Agents on…

200 papers

We build a benchmark to evaluate large language models (LLMs) for source code migration tasks, specifically upgrading functions from Java 8 to Java 11. We first collected a dataset of function pairs from open-source repositories, but…

Software Engineering · Computer Science 2026-02-11 Nishil Amin , Zhiwei Fei , Xiang Li , Justyna Petke , He Ye

Numerous software analysis tools exist today, yet applying them to diverse open-source projects remains challenging due to environment setup, dependency resolution, and tool configuration. LLM-based agents offer a potential solution, yet no…

Software Engineering · Computer Science 2026-04-20 Islem Bouzenia , Cristian Cadar , Michael Pradel

The rapid adoption of AI agents across domains has made systematic evaluation crucial for ensuring their usefulness and successful production deployment. Evaluation of AI agents typically involves using a fixed set of benchmarks and…

With the rapid advancement of powerful large language models (LLMs) in recent years, a wide range of software engineering tasks can now be addressed using LLMs, significantly enhancing productivity and scalability. Numerous benchmark…

Software Engineering · Computer Science 2026-05-29 Linbo Liu , Xinle Liu , Qiang Zhou , Lin Chen , Yihan Liu , Hoan Nguyen , Behrooz Omidvar-Tehrani , Xi Shen , Jun Huan , Omer Tripp , Anoop Deoras

The rapid advancement of artificial intelligence, particularly autonomous agentic systems based on Large Language Models (LLMs), presents new opportunities to accelerate drug discovery by improving in-silico modeling and reducing dependence…

Users across enterprises increasingly rely on AI agents to query their data through natural language. However, building reliable data agents remains difficult because real-world data is often fragmented across multiple heterogeneous…

Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS)…

Software Engineering · Computer Science 2025-05-29 Tobias Lindenbauer , Egor Bogomolov , Yaroslav Zharov

We present SorryDB, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark…

With AI agents increasingly deployed as long-running systems, it becomes essential to autonomously construct and continuously evolve customized software to enable interaction within dynamic environments. Yet, existing benchmarks evaluate…

Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current…

Software Engineering · Computer Science 2026-02-12 Qixing Zhou , Jiacheng Zhang , Haiyang Wang , Rui Hao , Jiahe Wang , Minghao Han , Yuxue Yang , Shuzhe Wu , Feiyang Pan , Lue Fan , Dandan Tu , Zhaoxiang Zhang

The evolution of AI coding agents has shifted the frontier from simple snippet completion to autonomous repository-level engineering. However, evaluating these agents remains ill-posed in general code repository generation, where the lack…

Software Engineering · Computer Science 2026-02-27 Xuefeng Li , Nir Ben-Israel , Yotam Raz , Belal Ahmed , Doron Serebro , Antoine Raux

The rise of agentic AI systems, where agents collaborate to perform diverse tasks, poses new challenges with observing, analyzing and optimizing their behavior. Traditional evaluation and benchmarking approaches struggle to handle the…

Artificial Intelligence · Computer Science 2025-03-11 Dany Moshkovich , Hadar Mulian , Sergey Zeltyn , Natti Eder , Inna Skarbovsky , Roy Abitbol

Recent coding agents can generate complete codebases from simple prompts, yet existing evaluations focus on issue-level bug fixing and lag behind end-to-end development. We introduce ProjDevBench, an end-to-end benchmark that provides…

Artificial Intelligence · Computer Science 2026-02-10 Pengrui Lu , Shiqi Zhang , Yunzhong Hou , Lyumanshan Ye , Chaoyi Huang , Zixi Chen , Ji Zeng , Hantao Jiang , Pengfei Liu , Yiwei Wang , Ming-Hsuan Yang

Automated evaluation of tool-using large language model (LLM) agents is widely assumed to be reliable, but this assumption has rarely been validated against human annotation. We introduce AgentProp-Bench, a 2,000-task benchmark with 2,300…

Artificial Intelligence · Computer Science 2026-04-21 Bhaskar Gurram

Evaluating Large Language Models (LLMs) on repository-level feature implementation is a critical frontier in software engineering. However, establishing a benchmark that faithfully mirrors realistic development scenarios remains a…

Computation and Language · Computer Science 2026-02-19 Haorui Chen , Chengze Li , Jia Li

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture…

Artificial Intelligence · Computer Science 2026-04-24 Keyu Li , Junhao Shi , Yang Xiao , Mohan Jiang , Jie Sun , Yunze Wu , Dayuan Fu , Shijie Xia , Xiaojie Cai , Tianze Xu , Weiye Si , Wenjie Li , Dequan Wang , Pengfei Liu

We present Agent-Diff, a novel benchmarking framework for evaluating agentic Large Language Models (LLMs) on real-world productivity software API tasks via code execution. Agentic LLM performance varies due to differences in models,…

Software Engineering · Computer Science 2026-04-29 Hubert M. Pysklo , Artem Zhuravel , Patrick D. Watson

Language-model agents are increasingly used as persistent coworkers that assist users across multiple working days. During such workflows, the surrounding environment may change independently of the agent: new emails arrive, calendar…

Enterprise adoption of agentic AI systems requires reliable evaluation methods that reflect real-world deployment scenarios. Traditional LLM benchmarks suffer from training data contamination and fail to measure agentic capabilities such as…

Artificial Intelligence · Computer Science 2025-11-12 JV Roig

In recent years, AI-based software engineering has progressed from pre-trained models to advanced agentic workflows, with Software Development Agents representing the next major leap. These agents, capable of reasoning, planning, and…

Software Engineering · Computer Science 2024-12-30 Zhi Chen , Lingxiao Jiang
‹ Prev 1 2 3 10 Next ›