English
Related papers

Related papers: Beyond Isolated Tasks: A Framework for Evaluating …

200 papers

Automating real-world software engineering tasks remains challenging for large language model (LLM)-based agents due to the need for long-horizon reasoning over large, evolving codebases and making consistent decisions across interdependent…

Software Engineering · Computer Science 2026-04-14 Mahir Labib Dihan , Md Ashrafur Rahman Khan

As autonomous code agents move toward end-to-end software development, evaluating their practical autonomy becomes critical. Current benchmarks hide friction by testing agents in pre-configured environments, and their static evaluation…

Software Engineering · Computer Science 2026-05-14 Hao Guan , Lingyue Fu , Shao Zhang , Yaoming Zhu , Kangning Zhang , Lin Qiu , Xunliang Cai , Xuezhi Cao , Weiwen Liu , Weinan Zhang , Yong Yu

In this paper, we present a comparative study of five autonomous coding agents using AIDev-pop, which is a public dataset containing thousands of AI-generated pull requests (PRs) across popular open-source repositories. We evaluate agents'…

Software Engineering · Computer Science 2026-02-03 Shojibur Rahman , Md Fazle Rabbi , Minhaz Zibran

Testing is a critical practice for ensuring software correctness and long-term maintainability. As agentic coding tools increasingly submit pull requests (PRs), it becomes essential to understand how testing appears in these agent-driven…

Software Engineering · Computer Science 2026-01-08 Sabrina Haque , Sarvesh Ingale , Christoph Csallner

Autonomous coding agents are increasingly deployed as AI teammates in modern software engineering, independently authoring pull requests (PRs) that modify production code at scale. This study aims to systematically characterize how…

Cryptography and Security · Computer Science 2026-01-05 Mohammed Latif Siddiq , Xinye Zhao , Vinicius Carvalho Lopes , Beatrice Casey , Joanna C. S. Santos

Unlike traditional automation tools or static LLM-based systems, agents combine decision-making and tool utilization to accomplish complex tasks, showing great potential in software engineering. However, existing studies largely focus on…

Software Engineering · Computer Science 2025-11-04 Zhuowen Yin , Cuifeng Gao , Chunsong Fan , Wenzhang Yang , Yinxing Xue , Lijun Zhang

LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that…

Repository-level code editing requires models to understand complex dependencies and execute precise multi-file modifications across a large codebase. While recent gains on SWE-bench rely heavily on complex agent scaffolding, it remains…

Software Engineering · Computer Science 2026-02-10 Qinglin Zhu , Tianyu Chen , Shuai Lu , Lei Ji , Runcong Zhao , Murong Ma , Xiangxiang Dai , Yulan He , Lin Gui , Peng cheng , Yeyun Gong

Executable software engineering data is valuable for training SWE agents, but scaling it remains difficult for two reasons: only a small fraction of real repository changes yield verifiable, high-signal task instances, and naively building…

Software Engineering · Computer Science 2026-03-24 Jiarong Liang , Zhiheng Lyu , Zijie Liu , Xiangchao Chen , Ping Nie , Kai Zou , Wenhu Chen

Coding agents powered by large language models are increasingly expected to perform realistic software maintenance tasks beyond isolated issue resolution. Existing benchmarks have shifted toward realistic software evolution, but they rarely…

Software Engineering · Computer Science 2026-05-15 Man Ho Lam , Chaozheng Wang , Hange Liu , Jingyu Xiao , Haau-sing Li , Jen-tse Huang , Terry Yue Zhuo , Michael R. Lyu

Although coding agents have introduced new coordination dynamics in collaborative software development, detailed interactions in practice remain underexplored, especially for the code review process. In this study, we mine agent-authored PR…

Software Engineering · Computer Science 2026-04-07 Islem Khemissi , Moataz Chouchen , Dong Wang , Raula Gaikovina Kula

As AI coding agents evolve from autocomplete tools to autonomous "AI workforce" teammates, they introduce a critical new bottleneck: human maintainers must now manage complex interaction loops rather than just reviewing code. Analyzing…

Prior works on training software engineering agents have explored utilizing existing resources such as issues on GitHub repositories to construct software engineering tasks and corresponding test suites. These approaches face two key…

Software Engineering · Computer Science 2026-01-13 Yiqi Zhu , Apurva Gandhi , Graham Neubig

We introduce SWE-PRBench, a benchmark of 350 pull requests with human-annotated ground truth for evaluating AI code review quality. Evaluated against an LLM-as-judge framework validated at kappa=0.75, 8 frontier models detect only 15-31% of…

Software Engineering · Computer Science 2026-03-30 Deepak Kumar

AI coding agents are increasingly acting as autonomous contributors by generating and submitting pull requests (PRs). However, we lack empirical evidence on how these agent-generated PRs differ from human contributions, particularly in how…

Software Engineering · Computer Science 2026-04-07 Daniel Ogenrwot , John Businge

Large Language Model (LLM) Agents are advancing quickly, with the increasing leveraging of LLM Agents to assist in development tasks such as code generation. While LLM Agents accelerate code generation, studies indicate they may introduce…

Software Engineering · Computer Science 2026-01-30 Haoming Huang , Pongchai Jaisri , Shota Shimizu , Lingfeng Chen , Sota Nakashima , Gema Rodríguez-Pérez

Optimizing the performance of large-scale software repositories demands expertise in code reasoning and software engineering (SWE) to reduce runtime while preserving program correctness. However, most benchmarks emphasize what to fix rather…

The increasing adoption of AI coding agents has increased the number of agent-generated pull requests (PRs) merged with little or no human intervention. Although such PRs promise productivity gains, their post-merge code quality remains…

Software Engineering · Computer Science 2026-01-29 Shamse Tasnim Cynthia , Al Muttakin , Banani Roy

We introduce SWE Atlas, a benchmark suite for coding agents spanning three professional software engineering workflows: Codebase Q&A (124 tasks), Test Writing (90 tasks), and Refactoring (70 tasks). SWE Atlas differs from prior SWE…

Resolving real-world software engineering (SWE) issues with autonomous agents requires complex, long-horizon reasoning. Current pipelines are bottlenecked by unoptimized demonstration data, sparse execution rewards, and computationally…

Software Engineering · Computer Science 2026-04-17 Hao Han , Jin Xie , Xuehao Ma , Weiquan Zhu , Ziyao Zhang , ZhiLiang Long , Hongkai Chen , Qingwen Ye
‹ Prev 1 2 3 10 Next ›