English
Related papers

Related papers: SWE Context Bench: A Benchmark for Context Learnin…

200 papers

LLM-based coding agents have shown strong performance on automated issue resolution benchmarks, yet existing evaluations largely focus on final task success, providing limited insight into how agents retrieve and use code context during…

Machine Learning · Computer Science 2026-02-12 Han Li , Letian Zhu , Bohan Zhang , Rili Feng , Jiaming Wang , Yue Pan , Earl T. Barr , Federica Sarro , Zhaoyang Chu , He Ye

Code Agent development is an extremely active research area, where a reliable performance metric is critical for tracking progress and guiding new developments. This demand is underscored by the meteoric rise in popularity of SWE-Bench.…

Software Engineering · Computer Science 2025-03-12 Konstantinos Vergopoulos , Mark Niklas Müller , Martin Vechev

Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer workflows. Existing benchmarks often focus on algorithmic…

Context engineering has emerged as a pivotal paradigm for unlocking the potential of Large Language Models (LLMs) in Software Engineering (SE) tasks, enabling performance gains at test time without model fine-tuning. Despite its success,…

Software Engineering · Computer Science 2026-04-07 Haichuan Hu , Quanjun Zhang , Ye Shang , Guoqing Xie , Chunrong Fang , Zhenyu Chen , Liang Xiao

LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that…

Benchmarks like SWE-bench have standardized the evaluation of Large Language Models (LLMs) on repository-level software engineering tasks. However, these efforts remain limited by manual curation, static datasets, and a focus on…

Coding agents powered by large language models have shown impressive capabilities in software engineering tasks, but evaluating their performance across diverse programming languages and real-world scenarios remains challenging. We…

Can large language model agents develop industry-level mobile applications? We introduce \textbf{SWE-Bench Mobile}, a benchmark for evaluating coding agents on realistic software engineering tasks derived from a production iOS codebase.…

Software Engineering · Computer Science 2026-02-11 Muxin Tian , Zhe Wang , Blair Yang , Zhenwei Tang , Kunlun Zhu , Honghua Dong , Hanchen Li , Xinni Xie , Guangjing Wang , Jiaxuan You

SWE-Bench-Verified, a dataset comprising 500 issues, serves as a de facto benchmark for evaluating various large language models (LLMs) on their ability to resolve GitHub issues. But this benchmark may overlap with model training data. If…

Software Engineering · Computer Science 2025-12-23 Thanosan Prathifkumar , Noble Saji Mathews , Meiyappan Nagappan

Optimizing the performance of large-scale software repositories demands expertise in code reasoning and software engineering (SWE) to reduce runtime while preserving program correctness. However, most benchmarks emphasize what to fix rather…

We introduce SWE-PRBench, a benchmark of 350 pull requests with human-annotated ground truth for evaluating AI code review quality. Evaluated against an LLM-as-judge framework validated at kappa=0.75, 8 frontier models detect only 15-31% of…

Software Engineering · Computer Science 2026-03-30 Deepak Kumar

Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution…

Software Engineering · Computer Science 2026-03-02 Ibragim Badertdinov , Maksim Nekrashevich , Anton Shevtsov , Alexander Golubev

Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current…

Software Engineering · Computer Science 2026-02-12 Qixing Zhou , Jiacheng Zhang , Haiyang Wang , Rui Hao , Jiahe Wang , Minghao Han , Yuxue Yang , Shuzhe Wu , Feiyang Pan , Lue Fan , Dandan Tu , Zhaoxiang Zhang

Large language model agents have made strong progress on software engineering, yet current systems suffer from a context coupling problem: the standard code editing interface conflates code inspection, modification planning, and edit…

Software Engineering · Computer Science 2026-05-27 Yikai Zhang , Jiaxin Pei , Kenan Li , Qirui Jin , Maoquan Wang , Jin Pan , Yu Kang , Shengyu Fu , Elsie Nallipogu , Junjie Hu , Yufan Huang , Zijian Jin

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and…

Computation and Language · Computer Science 2024-11-13 Carlos E. Jimenez , John Yang , Alexander Wettig , Shunyu Yao , Kexin Pei , Ofir Press , Karthik Narasimhan

The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in…

Large Language Models (LLMs) have achieved impressive results on static code-generation benchmarks, but real-world software development unfolds as a continuous stream of evolving issues, fixes, and feature requests. We introduce…

Machine Learning · Computer Science 2025-07-02 Thomas Joshi , Shayan Chowdhury , Fatih Uysal

The emergence of "vibe coding" platforms, where users describe applications in natural language and AI agents autonomously generate full-stack software, has created a need for rigorous evaluation beyond code-level benchmarks. In order to…

Multiagent Systems · Computer Science 2026-05-07 Siddhant Saxena , Nilesh Trivedi , Vinayaka Jyothi

AI coding agents have shown great progress on Python software engineering benchmarks like SWE-Bench, and for other languages like Java and C in benchmarks like Multi-SWE-Bench. However, C# -- a prominent enterprise language ranking #5 in…

Software Engineering · Computer Science 2025-11-19 Sanket Mhatre , Yasharth Bajpai , Sumit Gulwani , Emerson Murphy-Hill , Gustavo Soares

Large language models (LLMs) exhibit strong performance on self-contained programming tasks. However, they still struggle with repository-level software engineering (SWE), which demands (1) deep codebase navigation with effective context…

Software Engineering · Computer Science 2026-05-27 Kang He , Kaushik Roy
‹ Prev 1 2 3 10 Next ›