SWE Context Bench: A Benchmark for Context Learning in Coding
Abstract
Large language models are increasingly used as coding agents for software engineering tasks. Current benchmarks mainly evaluate whether the agent can correctly solve the request or fix the bugs. They largely treat tasks as independent and do not assess whether agents can reuse previous experience across related problems. As a result, the efficiency gains from reusing the previous experience remains difficult to measure. We introduce SWE-ContextBench, a benchmark designed to explicitly evaluate context understanding and retrieval in coding agents. SWE-ContextBench consists of 1,100 base tasks with another 376 related tasks derived from real dependency and reference relationships among GitHub issues and pull requests. SWE-ContextBench groups base tasks and related tasks with shared context across 51 unique repositories and 9 programming languages. The benchmark evaluates how accurately and efficiently agents solve related issues when prior cases are available in context. Using SWE-ContextBench, we study the behavior of multiple coding agents across varying context reuse settings and retrieval strategies. Our results show that accurately summarized and retrieved previous experience can significantly improve resolution accuracy and reduce runtime and token cost, particularly on harder tasks. In contrast, unfiltered or incorrectly selected context provides limited or negative benefits. These findings highlight the importance of context management and retrieval accuracy, and position SWE-ContextBench as a principled benchmark for studying context learning in coding agents.
Cite
@article{arxiv.2602.08316,
title = {SWE Context Bench: A Benchmark for Context Learning in Coding},
author = {Jiayuan Zhu and Junde Wu and Minhao Hu and Shengda Zhu and Jiazhen Pan and Weixiang Shen and Yijun Yang and Fenglin Liu and Jianye Hao and Yueming Jin and Qirong Ho and Min Xu},
journal= {arXiv preprint arXiv:2602.08316},
year = {2026}
}