Related papers: Beyond Isolated Tasks: A Framework for Evaluating …

SWE-Shepherd: Advancing PRMs for Reinforcing Code Agents

Automating real-world software engineering tasks remains challenging for large language model (LLM)-based agents due to the need for long-horizon reasoning over large, evolving codebases and making consistent decisions across interdependent…

Software Engineering · Computer Science 2026-04-14 Mahir Labib Dihan , Md Ashrafur Rahman Khan

SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle

As autonomous code agents move toward end-to-end software development, evaluating their practical autonomy becomes critical. Current benchmarks hide friction by testing agents in pre-configured environments, and their static evaluation…

Software Engineering · Computer Science 2026-05-14 Hao Guan , Lingyue Fu , Shao Zhang , Yaoming Zhu , Kangning Zhang , Lin Qiu , Xunliang Cai , Xuezhi Cao , Weiwen Liu , Weinan Zhang , Yong Yu

A Task-Level Evaluation of AI Agents in Open-Source Projects

In this paper, we present a comparative study of five autonomous coding agents using AIDev-pop, which is a public dataset containing thousands of AI-generated pull requests (PRs) across popular open-source repositories. We evaluate agents'…

Software Engineering · Computer Science 2026-02-03 Shojibur Rahman , Md Fazle Rabbi , Minhaz Zibran

Do Autonomous Agents Contribute Test Code? A Study of Tests in Agentic Pull Requests

Testing is a critical practice for ensuring software correctness and long-term maintainability. As agentic coding tools increasingly submit pull requests (PRs), it becomes essential to understand how testing appears in these agent-driven…

Software Engineering · Computer Science 2026-01-08 Sabrina Haque , Sarvesh Ingale , Christoph Csallner

Security in the Age of AI Teammates: An Empirical Study of Agentic Pull Requests on GitHub

Autonomous coding agents are increasingly deployed as AI teammates in modern software engineering, independently authoring pull requests (PRs) that modify production code at scale. This study aims to systematically characterize how…

Cryptography and Security · Computer Science 2026-01-05 Mohammed Latif Siddiq , Xinye Zhao , Vinicius Carvalho Lopes , Beatrice Casey , Joanna C. S. Santos

A Comprehensive Empirical Evaluation of Agent Frameworks on Code-centric Software Engineering Tasks

Unlike traditional automation tools or static LLM-based systems, agents combine decision-making and tool utilization to accomplish complex tasks, showing great potential in software engineering. However, existing studies largely focus on…

Software Engineering · Computer Science 2025-11-04 Zhuowen Yin , Cuifeng Gao , Chunsong Fan , Wenzhang Yang , Yinxing Xue , Lijun Zhang

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that…

Software Engineering · Computer Science 2025-11-05 Ibragim Badertdinov , Alexander Golubev , Maksim Nekrashevich , Anton Shevtsov , Simon Karasik , Andrei Andriushchenko , Maria Trofimova , Daria Litvintseva , Boris Yangel

Pull Requests as a Training Signal for Repo-Level Code Editing

Repository-level code editing requires models to understand complex dependencies and execute precise multi-file modifications across a large codebase. While recent gains on SWE-bench rely heavily on complex agent scaffolding, it remains…

Software Engineering · Computer Science 2026-02-10 Qinglin Zhu , Tianyu Chen , Shuai Lu , Lei Ji , Runcong Zhao , Murong Ma , Xiangxiang Dai , Yulan He , Lin Gui , Peng cheng , Yeyun Gong

SWE-Next: Scalable Real-World Software Engineering Tasks for Agents

Executable software engineering data is valuable for training SWE agents, but scaling it remains difficult for two reasons: only a small fraction of real repository changes yield verifiable, high-signal task instances, and naively building…

Software Engineering · Computer Science 2026-03-24 Jiarong Liang , Zhiheng Lyu , Zijie Liu , Xiangchao Chen , Ping Nie , Kai Zou , Wenhu Chen

SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

Coding agents powered by large language models are increasingly expected to perform realistic software maintenance tasks beyond isolated issue resolution. Existing benchmarks have shifted toward realistic software evolution, but they rarely…

Software Engineering · Computer Science 2026-05-15 Man Ho Lam , Chaozheng Wang , Hange Liu , Jingyu Xiao , Haau-sing Li , Jen-tse Huang , Terry Yue Zhuo , Michael R. Lyu

Humans Integrate, Agents Fix: How Agent-Authored Pull Requests Are Referenced in Practice

Although coding agents have introduced new coordination dynamics in collaborative software development, detailed interactions in practice remain underexplored, especially for the code review process. In this study, we mine agent-authored PR…

Software Engineering · Computer Science 2026-04-07 Islem Khemissi , Moataz Chouchen , Dong Wang , Raula Gaikovina Kula

Early-Stage Prediction of Review Effort in AI-Generated Pull Requests

As AI coding agents evolve from autocomplete tools to autonomous "AI workforce" teammates, they introduce a critical new bottleneck: human maintainers must now manage complex interaction loops rather than just reviewing code. Analyzing…

Software Engineering · Computer Science 2026-01-28 Dao Sy Duy Minh , Huynh Trung Kiet , Nguyen Lam Phu Quy , Pham Phu Hoa , Tran Chi Nguyen , Nguyen Dinh Ha Duong , Truong Bao Tran

Training Versatile Coding Agents in Synthetic Environments

Prior works on training software engineering agents have explored utilizing existing resources such as issues on GitHub repositories to construct software engineering tasks and corresponding test suites. These approaches face two key…

Software Engineering · Computer Science 2026-01-13 Yiqi Zhu , Apurva Gandhi , Graham Neubig

SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback

We introduce SWE-PRBench, a benchmark of 350 pull requests with human-annotated ground truth for evaluating AI code review quality. Evaluated against an LLM-as-judge framework validated at kappa=0.75, 8 frontier models detect only 15-31% of…

Software Engineering · Computer Science 2026-03-30 Deepak Kumar

How AI Coding Agents Modify Code: A Large-Scale Study of GitHub Pull Requests

AI coding agents are increasingly acting as autonomous contributors by generating and submitting pull requests (PRs). However, we lack empirical evidence on how these agent-generated PRs differ from human contributions, particularly in how…

Software Engineering · Computer Science 2026-04-07 Daniel Ogenrwot , John Businge

More Code, Less Reuse: Investigating Code Quality and Reviewer Sentiment towards AI-generated Pull Requests

Large Language Model (LLM) Agents are advancing quickly, with the increasing leveraging of LLM Agents to assist in development tasks such as code generation. While LLM Agents accelerate code generation, studies indicate they may introduce…

Software Engineering · Computer Science 2026-01-30 Haoming Huang , Pongchai Jaisri , Shota Shimizu , Lingfeng Chen , Sota Nakashima , Gema Rodríguez-Pérez

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Optimizing the performance of large-scale software repositories demands expertise in code reasoning and software engineering (SWE) to reduce runtime while preserving program correctness. However, most benchmarks emphasize what to fix rather…

Software Engineering · Computer Science 2025-11-12 Jeffrey Jian Ma , Milad Hashemi , Amir Yazdanbakhsh , Kevin Swersky , Ofir Press , Enhui Li , Vijay Janapa Reddi , Parthasarathy Ranganathan

Beyond Bug Fixes: An Empirical Investigation of Post-Merge Code Quality Issues in Agent-Generated Pull Requests

The increasing adoption of AI coding agents has increased the number of agent-generated pull requests (PRs) merged with little or no human intervention. Although such PRs promise productivity gains, their post-merge code quality remains…

Software Engineering · Computer Science 2026-01-29 Shamse Tasnim Cynthia , Al Muttakin , Banani Roy

SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution

We introduce SWE Atlas, a benchmark suite for coding agents spanning three professional software engineering workflows: Codebase Q&A (124 tasks), Test Writing (90 tasks), and Refactoring (70 tasks). SWE Atlas differs from prior SWE…

Machine Learning · Computer Science 2026-05-12 Mohit Raghavendra , Soham Dan , Miguel Romero Calvo , Yannis Yiming He , Johannes Baptist Mols , Gautam Anand , Cole McCollum , Edgar Arakelyan , Vijay Bharadwaj , Andrew Park , Jeff Da , MohammadHossein Rezaei , Bing Liu , Brad Kenstler , Yunzhong He

SWE-TRACE: Optimizing Long-Horizon SWE Agents Through Rubric Process Reward Models and Heuristic Test-Time Scaling

Resolving real-world software engineering (SWE) issues with autonomous agents requires complex, long-horizon reasoning. Current pipelines are bottlenecked by unoptimized demonstration data, sparse execution rewards, and computationally…

Software Engineering · Computer Science 2026-04-17 Hao Han , Jin Xie , Xuehao Ma , Weiquan Zhu , Ziyao Zhang , ZhiLiang Long , Hongkai Chen , Qingwen Ye