Related papers: Evaluating Agent-based Program Repair at Google

An Empirical Study on LLM-based Agents for Automated Bug Fixing

Large language models (LLMs) and LLM-based Agents have been applied to fix bugs automatically, demonstrating the capability in addressing software defects by engaging in development environment interaction, iterative validation and code…

Software Engineering · Computer Science 2025-10-21 Xiangxin Meng , Zexiong Ma , Pengfei Gao , Chao Peng

Can Agents Fix Agent Issues?

LLM-based agent systems are emerging as a new software paradigm and have been widely adopted across diverse domains such as medicine, robotics, and programming. However, maintaining these systems requires substantial effort, as they are…

Artificial Intelligence · Computer Science 2025-10-27 Alfin Wijaya Rahardja , Junwei Liu , Weitong Chen , Zhenpeng Chen , Yiling Lou

SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs

Large language models (LLMs) are transforming automated program repair (APR) through agent-based approaches that localize bugs, generate patches, and verify fixes. However, the lack of high-quality, scalable training datasets, especially…

Software Engineering · Computer Science 2025-12-23 Minh V. T. Pham , Huy N. Phan , Hoang N. Phan , Cuong Le Chi , Tien N. Nguyen , Nghi D. Q. Bui

PerfBench: Can Agents Resolve Real-World Performance Bugs?

Performance bugs are inefficiencies in software that waste computational resources without causing functional failures, making them particularly challenging to detect and fix. While recent advances in Software Engineering agents have shown…

Software Engineering · Computer Science 2025-12-04 Spandan Garg , Roshanak Zilouchian Moghaddam , Neel Sundaresan

SemAgent: A Semantics Aware Program Repair Agent

Large Language Models (LLMs) have shown impressive capabilities in downstream software engineering tasks such as Automated Program Repair (APR). In particular, there has been a lot of research on repository-level issue-resolution benchmarks…

Software Engineering · Computer Science 2025-06-23 Anvith Pabba , Alex Mathai , Anindya Chakraborty , Baishakhi Ray

SWE-Bench 5G: Benchmarking AI Coding Agents on Telecom Network Engineering Tasks

AI coding agents demonstrate strong performance on general-purpose software benchmarks. However, their ability to handle 5G network engineering tasks remains unexplored. We propose SWE-Bench~5G, the first benchmark designed to investigate…

Networking and Internet Architecture · Computer Science 2026-04-30 Jiao Chen , Jianhua Tang , Xiaotong Yang , Zuohong Lv

Understanding Automated Program Repair Agents Through the Lens of Traceability: An Empirical Study

Automated Program Repair (APR) agents leverage Large Language Models (LLMs) to autonomously diagnose and fix software bugs through reasoning, planning, and tool use. Despite impressive leaderboard gains on benchmarks such as SWE-bench,…

Software Engineering · Computer Science 2026-05-28 Ira Ceka , Hailie Mitchell , Saurabh Pujar , Luca Buratti , Shyam Ramji , Junfeng Yang , Gail Kaiser , Baishakhi Ray

PatchPilot: A Cost-Efficient Software Engineering Agent with Early Attempts on Formal Verification

Recent research builds various patching agents that combine large language models (LLMs) with non-ML tools and achieve promising results on the state-of-the-art (SOTA) software patching benchmark, SWE-bench. Based on how to determine the…

Robotics · Computer Science 2025-06-12 Hongwei Li , Yuheng Tang , Shiqi Wang , Wenbo Guo

GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git

Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS)…

Software Engineering · Computer Science 2025-05-29 Tobias Lindenbauer , Egor Bogomolov , Yaroslav Zharov

Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Scenarios

In recent years, AI-based software engineering has progressed from pre-trained models to advanced agentic workflows, with Software Development Agents representing the next major leap. These agents, capable of reasoning, planning, and…

Software Engineering · Computer Science 2024-12-30 Zhi Chen , Lingxiao Jiang

Agentic Bug Reproduction for Effective Automated Program Repair at Google

Bug reports often lack sufficient detail for developers to reproduce and fix the underlying defects. Bug Reproduction Tests (BRTs), tests that fail when the bug is present and pass when it has been resolved, are crucial for debugging, but…

Software Engineering · Computer Science 2025-03-12 Runxiang Cheng , Michele Tufano , Jürgen Cito , José Cambronero , Pat Rondon , Renyao Wei , Aaron Sun , Satish Chandra

Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems

The rapid progress in Automated Program Repair (APR) has been driven by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a recent benchmark designed to evaluate LLM-based repair systems using…

Software Engineering · Computer Science 2026-02-06 Matias Martinez , Xavier Franch

SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents

Large Language Models (LLMs) have transformed software development and AI applications. While LLMs are designed for text processing, LLM agents extend this capability by enabling autonomous actions, tool use, and multi-step task completion.…

Software Engineering · Computer Science 2026-04-21 Niful Islam , Muhammad Anas Raza , Mohammad Wardat

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

We introduce SWE-Bench Pro, a substantially more challenging benchmark that builds upon the best practices of SWE-BENCH [25], but is explicitly designed to capture realistic, complex, enterprise-level problems beyond the scope of SWE-BENCH.…

Software Engineering · Computer Science 2025-11-18 Xiang Deng , Jeff Da , Edwin Pan , Yannis Yiming He , Charles Ide , Kanak Garg , Niklas Lauffer , Andrew Park , Nitin Pasari , Chetan Rane , Karmini Sampath , Maya Krishnan , Srivatsa Kundurthy , Sean Hendryx , Zifan Wang , Vijay Bharadwaj , Jeff Holm , Raja Aluri , Chen Bo Calvin Zhang , Noah Jacobson , Bing Liu , Brad Kenstler

Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation

Current benchmarks for evaluating software engineering agents, such as SWE-Bench Verified, are predominantly derived from GitHub issues and fail to accurately reflect how developers interact with chat-based coding assistants in integrated…

Software Engineering · Computer Science 2026-01-27 Spandan Garg , Benjamin Steenhoek , Yufan Huang

SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

Coding agents powered by large language models are increasingly expected to perform realistic software maintenance tasks beyond isolated issue resolution. Existing benchmarks have shifted toward realistic software evolution, but they rarely…

Software Engineering · Computer Science 2026-05-15 Man Ho Lam , Chaozheng Wang , Hange Liu , Jingyu Xiao , Haau-sing Li , Jen-tse Huang , Terry Yue Zhuo , Michael R. Lyu

An Empirical Study on Failures in Automated Issue Solving

Automated issue solving seeks to autonomously identify and repair defective code snippets across an entire codebase. SWE-Bench has emerged as the most widely adopted benchmark for evaluating progress in this area. While LLM-based agentic…

Software Engineering · Computer Science 2025-09-18 Simiao Liu , Fang Liu , Liehao Li , Xin Tan , Yinghao Zhu , Xiaoli Lian , Li Zhang

Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench

We ask whether agentic AI systems built for software engineering transfer to realistic hardware engineering. Existing hardware LLM benchmarks isolate sub-tasks but none jointly requires repository navigation, hierarchy-aware localization,…

Hardware Architecture · Computer Science 2026-05-18 Qingyun Zou , Feng Yu , Hongshi Tan , Bingsheng He , WengFai Wong

Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios

AI-driven software development has rapidly advanced with the emergence of software development agents that leverage large language models (LLMs) to tackle complex, repository-level software engineering tasks. These agents go beyond just…

Software Engineering · Computer Science 2026-04-10 Zhi Chen , Wei Ma , Lingxiao Jiang

Beyond Accuracy: Behavioral Dynamics of Agentic Multi-Hunk Repair

Automated program repair has traditionally focused on single-hunk defects, overlooking multi-hunk bugs that are prevalent in real-world systems. Repairing these bugs requires coordinated edits across multiple, disjoint code regions, posing…

Software Engineering · Computer Science 2025-11-17 Noor Nashid , Daniel Ding , Keheliya Gallaba , Ahmed E. Hassan , Ali Mesbah