Related papers: SWE Context Bench: A Benchmark for Context Learnin…

ContextBench: A Benchmark for Context Retrieval in Coding Agents

LLM-based coding agents have shown strong performance on automated issue resolution benchmarks, yet existing evaluations largely focus on final task success, providing limited insight into how agents retrieve and use code context during…

Machine Learning · Computer Science 2026-02-12 Han Li , Letian Zhu , Bohan Zhang , Rili Feng , Jiaming Wang , Yue Pan , Earl T. Barr , Federica Sarro , Zhaoyang Chu , He Ye

Automated Benchmark Generation for Repository-Level Coding Tasks

Code Agent development is an extremely active research area, where a reliable performance metric is critical for tracking progress and guiding new developments. This demand is underscored by the meteoric rise in popularity of SWE-Bench.…

Software Engineering · Computer Science 2025-03-12 Konstantinos Vergopoulos , Mark Niklas Müller , Martin Vechev

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer workflows. Existing benchmarks often focus on algorithmic…

Software Engineering · Computer Science 2025-11-12 Jingxuan Xu , Ken Deng , Weihao Li , Songwei Yu , Huaixi Tang , Haoyang Huang , Zhiyi Lai , Zizheng Zhan , Yanan Wu , Chenchen Zhang , Kepeng Lei , Yifan Yao , Xinping Lei , Wenqiang Zhu , Zongxian Feng , Han Li , Junqi Xiong , Dailin Li , Zuchen Gao , Kun Wu , Wen Xiang , Ziqi Zhan , Yuanxing Zhang , Wuxuan Gong , Ziyuan Gao , Guanxiang Wang , Yirong Xue , Mengtong Li , Mengfei Xie , Xiaojiang Zhang , Jinghui Wang , Wenhao Zhuang , Zheng Lin , Huiming Wang , Zhaoxiang Zhang , Yuqun Zhang , Haotian Zhang , Bin Chen , Jiaheng Liu

CL4SE: Benchmarking Context Learning on Software Engineering

Context engineering has emerged as a pivotal paradigm for unlocking the potential of Large Language Models (LLMs) in Software Engineering (SE) tasks, enabling performance gains at test time without model fine-tuning. Despite its success,…

Software Engineering · Computer Science 2026-04-07 Haichuan Hu , Quanjun Zhang , Ye Shang , Guoqing Xie , Chunrong Fang , Zhenyu Chen , Liang Xiao

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that…

Software Engineering · Computer Science 2025-11-05 Ibragim Badertdinov , Alexander Golubev , Maksim Nekrashevich , Anton Shevtsov , Simon Karasik , Andrei Andriushchenko , Maria Trofimova , Daria Litvintseva , Boris Yangel

SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories

Benchmarks like SWE-bench have standardized the evaluation of Large Language Models (LLMs) on repository-level software engineering tasks. However, these efforts remain limited by manual curation, static datasets, and a focus on…

Software Engineering · Computer Science 2025-12-22 Lilin Wang , Lucas Ramalho , Alan Celestino , Phuc Anthony Pham , Yu Liu , Umang Kumar Sinha , Andres Portillo , Onassis Osunwa , Gabriel Maduekwe

SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents

Coding agents powered by large language models have shown impressive capabilities in software engineering tasks, but evaluating their performance across diverse programming languages and real-world scenarios remains challenging. We…

Software Engineering · Computer Science 2025-04-25 Muhammad Shihab Rashid , Christian Bock , Yuan Zhuang , Alexander Buchholz , Tim Esler , Simon Valentin , Luca Franceschi , Martin Wistuba , Prabhu Teja Sivaprasad , Woo Jung Kim , Anoop Deoras , Giovanni Zappella , Laurent Callot

SWE-Bench Mobile: Can Large Language Model Agents Develop Industry-Level Mobile Applications?

Can large language model agents develop industry-level mobile applications? We introduce \textbf{SWE-Bench Mobile}, a benchmark for evaluating coding agents on realistic software engineering tasks derived from a production iOS codebase.…

Software Engineering · Computer Science 2026-02-11 Muxin Tian , Zhe Wang , Blair Yang , Zhenwei Tang , Kunlun Zhu , Honghua Dong , Hanchen Li , Xinni Xie , Guangjing Wang , Jiaxuan You

Does SWE-Bench-Verified Test Agent Ability or Model Memory?

SWE-Bench-Verified, a dataset comprising 500 issues, serves as a de facto benchmark for evaluating various large language models (LLMs) on their ability to resolve GitHub issues. But this benchmark may overlap with model training data. If…

Software Engineering · Computer Science 2025-12-23 Thanosan Prathifkumar , Noble Saji Mathews , Meiyappan Nagappan

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Optimizing the performance of large-scale software repositories demands expertise in code reasoning and software engineering (SWE) to reduce runtime while preserving program correctness. However, most benchmarks emphasize what to fix rather…

Software Engineering · Computer Science 2025-11-12 Jeffrey Jian Ma , Milad Hashemi , Amir Yazdanbakhsh , Kevin Swersky , Ofir Press , Enhui Li , Vijay Janapa Reddi , Parthasarathy Ranganathan

SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback

We introduce SWE-PRBench, a benchmark of 350 pull requests with human-annotated ground truth for evaluating AI code review quality. Evaluated against an LLM-as-judge framework validated at kappa=0.75, 8 frontier models detect only 15-31% of…

Software Engineering · Computer Science 2026-03-30 Deepak Kumar

SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution…

Software Engineering · Computer Science 2026-03-02 Ibragim Badertdinov , Maksim Nekrashevich , Anton Shevtsov , Alexander Golubev

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current…

Software Engineering · Computer Science 2026-02-12 Qixing Zhou , Jiacheng Zhang , Haiyang Wang , Rui Hao , Jiahe Wang , Minghao Han , Yuxue Yang , Shuzhe Wu , Feiyang Pan , Lue Fan , Dandan Tu , Zhaoxiang Zhang

SWE-Edit: Rethinking Code Editing for Efficient SWE-Agent

Large language model agents have made strong progress on software engineering, yet current systems suffer from a context coupling problem: the standard code editing interface conflates code inspection, modification planning, and edit…

Software Engineering · Computer Science 2026-05-27 Yikai Zhang , Jiaxin Pei , Kenan Li , Qirui Jin , Maoquan Wang , Jin Pan , Yu Kang , Shengyu Fu , Elsie Nallipogu , Junjie Hu , Yufan Huang , Zijian Jin

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and…

Computation and Language · Computer Science 2024-11-13 Carlos E. Jimenez , John Yang , Alexander Wettig , Shunyu Yao , Kexin Pei , Ofir Press , Karthik Narasimhan

SWE-bench Goes Live!

The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in…

Software Engineering · Computer Science 2025-06-03 Linghao Zhang , Shilin He , Chaoyun Zhang , Yu Kang , Bowen Li , Chengxing Xie , Junhao Wang , Maoquan Wang , Yufan Huang , Shengyu Fu , Elsie Nallipogu , Qingwei Lin , Yingnong Dang , Saravan Rajmohan , Dongmei Zhang

SWE-Bench-CL: Continual Learning for Coding Agents

Large Language Models (LLMs) have achieved impressive results on static code-generation benchmarks, but real-world software development unfolds as a continuous stream of evolving issues, fixes, and feature requests. We introduce…

Machine Learning · Computer Science 2025-07-02 Thomas Joshi , Shayan Chowdhury , Fatih Uysal

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

The emergence of "vibe coding" platforms, where users describe applications in natural language and AI agents autonomously generate full-stack software, has created a need for rigorous evaluation beyond code-level benchmarks. In order to…

Multiagent Systems · Computer Science 2026-05-07 Siddhant Saxena , Nilesh Trivedi , Vinayaka Jyothi

SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks

AI coding agents have shown great progress on Python software engineering benchmarks like SWE-Bench, and for other languages like Java and C in benchmarks like Multi-SWE-Bench. However, C# -- a prominent enterprise language ranking #5 in…

Software Engineering · Computer Science 2025-11-19 Sanket Mhatre , Yasharth Bajpai , Sumit Gulwani , Emerson Murphy-Hill , Gustavo Soares

SWE-Adept: An LLM-Based Agentic Framework for Deep Codebase Analysis and Structured Issue Resolution

Large language models (LLMs) exhibit strong performance on self-contained programming tasks. However, they still struggle with repository-level software engineering (SWE), which demands (1) deep codebase navigation with effective context…

Software Engineering · Computer Science 2026-05-27 Kang He , Kaushik Roy