English
Related papers

Related papers: CodeR: Issue Resolving with Multi-Agent and Task G…

200 papers

Autonomous systems for software engineering are now capable of fixing bugs and developing features. These systems are commonly evaluated on SWE-bench (Jimenez et al., 2024a), which assesses their ability to solve software issues from GitHub…

GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large…

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and…

Computation and Language · Computer Science 2024-11-13 Carlos E. Jimenez , John Yang , Alexander Wettig , Shunyu Yao , Kexin Pei , Ofir Press , Karthik Narasimhan

The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large…

Researchers have made significant progress in automating the software development process in the past decades. Recent progress in Large Language Models (LLMs) has significantly impacted the development process, where developers can use…

Software Engineering · Computer Science 2024-07-26 Yuntong Zhang , Haifeng Ruan , Zhiyu Fan , Abhik Roychoudhury

AI coding agents demonstrate strong performance on general-purpose software benchmarks. However, their ability to handle 5G network engineering tasks remains unexplored. We propose SWE-Bench~5G, the first benchmark designed to investigate…

Networking and Internet Architecture · Computer Science 2026-04-30 Jiao Chen , Jianhua Tang , Xiaotong Yang , Zuohong Lv

Large language models are increasingly used as coding agents for software engineering tasks. Current benchmarks mainly evaluate whether the agent can correctly solve the request or fix the bugs. They largely treat tasks as independent and…

Software Engineering · Computer Science 2026-05-07 Jiayuan Zhu , Junde Wu , Minhao Hu , Shengda Zhu , Jiazhen Pan , Weixiang Shen , Yijun Yang , Fenglin Liu , Jianye Hao , Yueming Jin , Qirong Ho , Min Xu

SWE-Bench-Verified, a dataset comprising 500 issues, serves as a de facto benchmark for evaluating various large language models (LLMs) on their ability to resolve GitHub issues. But this benchmark may overlap with model training data. If…

Software Engineering · Computer Science 2025-12-23 Thanosan Prathifkumar , Noble Saji Mathews , Meiyappan Nagappan

Issue resolution has made remarkable progress thanks to the advanced reasoning capabilities of large language models (LLMs). Recently, agent-based frameworks such as SWE-agent have further advanced this progress by enabling autonomous,…

Software Engineering · Computer Science 2025-08-01 Han Li , Yuling Shi , Shaoxin Lin , Xiaodong Gu , Heng Lian , Xin Wang , Yantao Jia , Tao Huang , Qianxiang Wang

When assessing the quality of coding agents, predominant benchmarks focus on solving single issues on GitHub, such as SWE-Bench. In contrast, in real use, these agents solve more various and complex tasks that involve other skills such as…

Code Agent development is an extremely active research area, where a reliable performance metric is critical for tracking progress and guiding new developments. This demand is underscored by the meteoric rise in popularity of SWE-Bench.…

Software Engineering · Computer Science 2025-03-12 Konstantinos Vergopoulos , Mark Niklas Müller , Martin Vechev

Agent-based program repair offers to automatically resolve complex bugs end-to-end by combining the planning, tool use, and code generation abilities of modern LLMs. Recent work has explored the use of agent-based repair approaches on the…

Software Engineering · Computer Science 2025-01-14 Pat Rondon , Renyao Wei , José Cambronero , Jürgen Cito , Aaron Sun , Siddhant Sanyam , Michele Tufano , Satish Chandra

LLMs generate buggy code: 29.6% of SWE-bench solved patches fail, 62% of BaxBench solutions have vulnerabilities, and existing tools only catch 65% of bugs with 35% false positives. We built CodeX-Verify, a multi-agent system that uses four…

Software Engineering · Computer Science 2025-12-05 Shreshth Rajan

The rapid advancement of Large Language Models (LLMs) in software engineering has revealed critical limitations in existing benchmarks, particularly the widely used SWE-bench dataset. Recent studies have uncovered severe data contamination…

Automated issue solving seeks to autonomously identify and repair defective code snippets across an entire codebase. SWE-Bench has emerged as the most widely adopted benchmark for evaluating progress in this area. While LLM-based agentic…

Software Engineering · Computer Science 2025-09-18 Simiao Liu , Fang Liu , Liehao Li , Xin Tan , Yinghao Zhu , Xiaoli Lian , Li Zhang

Current code-agent benchmarks primarily evaluate localized issue resolution within a single target repository, leaving under-tested many software engineering tasks that require external knowledge or broader repository-level changes. We…

Computation and Language · Computer Science 2026-05-27 Guoxin Chen , Fanzhe Meng , Jiale Zhao , Minghao Li , Daixuan Cheng , Huatong Song , Jie Chen , Yuzhi Lin , Hui Chen , Xin Zhao , Ruihua Song , Chang Liu , Cheng Chen , Kai Jia , Ji-Rong Wen

Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks. One significant application of LLMs is in tackling software engineering challenges, particularly in resolving real-world tasks on…

Computation and Language · Computer Science 2025-05-08 Chengxing Xie , Bowen Li , Chang Gao , He Du , Wai Lam , Difan Zou , Kai Chen

Resolving issues on code repositories is an important part of software engineering. Various recent systems automatically resolve issues using large language models and agents, often with impressive performance. Unfortunately, most of these…

Software Engineering · Computer Science 2026-03-13 Jatin Ganhotra , Sami Serhan , Antonio Abu Nassar , Avraham Shinnar , Ziv Nevo , Martin Hirzel

The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in…

Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS)…

Software Engineering · Computer Science 2025-05-29 Tobias Lindenbauer , Egor Bogomolov , Yaroslav Zharov
‹ Prev 1 2 3 10 Next ›