Related papers: CodeR: Issue Resolving with Multi-Agent and Task G…

SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?

Autonomous systems for software engineering are now capable of fixing bugs and developing features. These systems are commonly evaluated on SWE-bench (Jimenez et al., 2024a), which assesses their ability to solve software issues from GitHub…

Computation and Language · Computer Science 2024-10-08 John Yang , Carlos E. Jimenez , Alex L. Zhang , Kilian Lieret , Joyce Yang , Xindi Wu , Ori Press , Niklas Muennighoff , Gabriel Synnaeve , Karthik R. Narasimhan , Diyi Yang , Sida I. Wang , Ofir Press

SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large…

Software Engineering · Computer Science 2024-08-27 Daoguang Zan , Zhirong Huang , Ailun Yu , Shaoxin Lin , Yifan Shi , Wei Liu , Dong Chen , Zongshuai Qi , Hao Yu , Lei Yu , Dezhi Ran , Muhan Zeng , Bo Shen , Pan Bian , Guangtai Liang , Bei Guan , Pengjie Huang , Tao Xie , Yongji Wang , Qianxiang Wang

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and…

Computation and Language · Computer Science 2024-11-13 Carlos E. Jimenez , John Yang , Alexander Wettig , Shunyu Yao , Kexin Pei , Ofir Press , Karthik Narasimhan

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large…

Software Engineering · Computer Science 2025-04-04 Daoguang Zan , Zhirong Huang , Wei Liu , Hanwu Chen , Linhao Zhang , Shulin Xin , Lu Chen , Qi Liu , Xiaojian Zhong , Aoyan Li , Siyao Liu , Yongsheng Xiao , Liangqiang Chen , Yuyu Zhang , Jing Su , Tianyu Liu , Rui Long , Kai Shen , Liang Xiang

AutoCodeRover: Autonomous Program Improvement

Researchers have made significant progress in automating the software development process in the past decades. Recent progress in Large Language Models (LLMs) has significantly impacted the development process, where developers can use…

Software Engineering · Computer Science 2024-07-26 Yuntong Zhang , Haifeng Ruan , Zhiyu Fan , Abhik Roychoudhury

SWE-Bench 5G: Benchmarking AI Coding Agents on Telecom Network Engineering Tasks

AI coding agents demonstrate strong performance on general-purpose software benchmarks. However, their ability to handle 5G network engineering tasks remains unexplored. We propose SWE-Bench~5G, the first benchmark designed to investigate…

Networking and Internet Architecture · Computer Science 2026-04-30 Jiao Chen , Jianhua Tang , Xiaotong Yang , Zuohong Lv

SWE Context Bench: A Benchmark for Context Learning in Coding

Large language models are increasingly used as coding agents for software engineering tasks. Current benchmarks mainly evaluate whether the agent can correctly solve the request or fix the bugs. They largely treat tasks as independent and…

Software Engineering · Computer Science 2026-05-07 Jiayuan Zhu , Junde Wu , Minhao Hu , Shengda Zhu , Jiazhen Pan , Weixiang Shen , Yijun Yang , Fenglin Liu , Jianye Hao , Yueming Jin , Qirong Ho , Min Xu

Does SWE-Bench-Verified Test Agent Ability or Model Memory?

SWE-Bench-Verified, a dataset comprising 500 issues, serves as a de facto benchmark for evaluating various large language models (LLMs) on their ability to resolve GitHub issues. But this benchmark may overlap with model training data. If…

Software Engineering · Computer Science 2025-12-23 Thanosan Prathifkumar , Noble Saji Mathews , Meiyappan Nagappan

SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution

Issue resolution has made remarkable progress thanks to the advanced reasoning capabilities of large language models (LLMs). Recently, agent-based frameworks such as SWE-agent have further advanced this progress by enabling autonomous,…

Software Engineering · Computer Science 2025-08-01 Han Li , Yuling Shi , Shaoxin Lin , Xiaodong Gu , Heng Lian , Xin Wang , Yantao Jia , Tao Huang , Qianxiang Wang

Hybrid-Gym: Training Coding Agents to Generalize Across Tasks

When assessing the quality of coding agents, predominant benchmarks focus on solving single issues on GitHub, such as SWE-Bench. In contrast, in real use, these agents solve more various and complex tasks that involve other skills such as…

Software Engineering · Computer Science 2026-02-20 Yiqing Xie , Emmy Liu , Gaokai Zhang , Nachiket Kotalwar , Shubham Gandhi , Sathwik Acharya , Xingyao Wang , Carolyn Rose , Graham Neubig , Daniel Fried

Automated Benchmark Generation for Repository-Level Coding Tasks

Code Agent development is an extremely active research area, where a reliable performance metric is critical for tracking progress and guiding new developments. This demand is underscored by the meteoric rise in popularity of SWE-Bench.…

Software Engineering · Computer Science 2025-03-12 Konstantinos Vergopoulos , Mark Niklas Müller , Martin Vechev

Evaluating Agent-based Program Repair at Google

Agent-based program repair offers to automatically resolve complex bugs end-to-end by combining the planning, tool use, and code generation abilities of modern LLMs. Recent work has explored the use of agent-based repair approaches on the…

Software Engineering · Computer Science 2025-01-14 Pat Rondon , Renyao Wei , José Cambronero , Jürgen Cito , Aaron Sun , Siddhant Sanyam , Michele Tufano , Satish Chandra

Multi-Agent Code Verification via Information Theory

LLMs generate buggy code: 29.6% of SWE-bench solved patches fail, 62% of BaxBench solutions have vulnerabilities, and existing tools only catch 65% of bugs with 35% false positives. We built CodeX-Verify, a multi-agent system that uses four…

Software Engineering · Computer Science 2025-12-05 Shreshth Rajan

SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks

The rapid advancement of Large Language Models (LLMs) in software engineering has revealed critical limitations in existing benchmarks, particularly the widely used SWE-bench dataset. Recent studies have uncovered severe data contamination…

Software Engineering · Computer Science 2025-07-18 Pavel Adamenko , Mikhail Ivanov , Aidar Valeev , Rodion Levichev , Pavel Zadorozhny , Ivan Lopatin , Dmitry Babayev , Alena Fenogenova , Valentin Malykh

An Empirical Study on Failures in Automated Issue Solving

Automated issue solving seeks to autonomously identify and repair defective code snippets across an entire codebase. SWE-Bench has emerged as the most widely adopted benchmark for evaluating progress in this area. While LLM-based agentic…

Software Engineering · Computer Science 2025-09-18 Simiao Liu , Fang Liu , Liehao Li , Xin Tan , Yinghao Zhu , Xiaoli Lian , Li Zhang

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Current code-agent benchmarks primarily evaluate localized issue resolution within a single target repository, leaving under-tested many software engineering tasks that require external knowledge or broader repository-level changes. We…

Computation and Language · Computer Science 2026-05-27 Guoxin Chen , Fanzhe Meng , Jiale Zhao , Minghao Li , Daixuan Cheng , Huatong Song , Jie Chen , Yuzhi Lin , Hui Chen , Xin Zhao , Ruihua Song , Chang Liu , Cheng Chen , Kai Jia , Ji-Rong Wen

SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution

Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks. One significant application of LLMs is in tackling software engineering challenges, particularly in resolving real-world tasks on…

Computation and Language · Computer Science 2025-05-08 Chengxing Xie , Bowen Li , Chang Gao , He Du , Wai Lam , Difan Zou , Kai Chen

Resolving Java Code Repository Issues with iSWE Agent

Resolving issues on code repositories is an important part of software engineering. Various recent systems automatically resolve issues using large language models and agents, often with impressive performance. Unfortunately, most of these…

Software Engineering · Computer Science 2026-03-13 Jatin Ganhotra , Sami Serhan , Antonio Abu Nassar , Avraham Shinnar , Ziv Nevo , Martin Hirzel

SWE-bench Goes Live!

The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in…

Software Engineering · Computer Science 2025-06-03 Linghao Zhang , Shilin He , Chaoyun Zhang , Yu Kang , Bowen Li , Chengxing Xie , Junhao Wang , Maoquan Wang , Yufan Huang , Shengyu Fu , Elsie Nallipogu , Qingwei Lin , Yingnong Dang , Saravan Rajmohan , Dongmei Zhang

GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git

Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS)…

Software Engineering · Computer Science 2025-05-29 Tobias Lindenbauer , Egor Bogomolov , Yaroslav Zharov