Related papers: ATime-Consistent Benchmark for Repository-Level So…

Toward a Benchmark Repository for Software Maintenance Tool Evaluations with Humans

To evaluate software maintenance techniques and tools in controlled experiments with human participants, researchers currently use projects and tasks selected on an ad-hoc basis. This can unrealistically favor their tool, and it makes the…

Software Engineering · Computer Science 2020-12-01 Matúš Sulír

Re-Evaluating Code LLM Benchmarks Under Semantic Mutation

In the era of large language models (LLMs), code benchmarks have become an important research area in software engineering and are widely used by practitioners. These benchmarks evaluate the performance of LLMs on specific code-related…

Software Engineering · Computer Science 2025-06-24 Zhiyuan Pan , Xing Hu , Xin Xia , Xiaohu Yang

RepoMirage: Probing Repository Context Reasoning in Code Agents with Perturbations

Code agents are currently having skillful performance on repository-level software engineering benchmarks, but it remains unclear whether success on end-to-end tasks such as issue resolution truly reflects repository context reasoning, the…

Software Engineering · Computer Science 2026-05-27 Hanyu Li , Yichi Zhang , Speed Zhu , Hang Su , Jun Zhu , Yinpeng Dong

RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing

The evolution of AI coding agents has shifted the frontier from simple snippet completion to autonomous repository-level engineering. However, evaluating these agents remains ill-posed in general code repository generation, where the lack…

Software Engineering · Computer Science 2026-02-27 Xuefeng Li , Nir Ben-Israel , Yotam Raz , Belal Ahmed , Doron Serebro , Antoine Raux

CodeRepoQA: A Large-scale Benchmark for Software Engineering Question Answering

In this work, we introduce CodeRepoQA, a large-scale benchmark specifically designed for evaluating repository-level question-answering capabilities in the field of software engineering. CodeRepoQA encompasses five programming languages and…

Software Engineering · Computer Science 2024-12-20 Ruida Hu , Chao Peng , Jingyi Ren , Bo Jiang , Xiangxin Meng , Qinyun Wu , Pengfei Gao , Xinchen Wang , Cuiyun Gao

RepoTransBench: A Real-World Multilingual Benchmark for Repository-Level Code Translation

Repository-level code translation refers to translating an entire code repository from one programming language to another while preserving the functionality of the source repository. Many benchmarks have been proposed to evaluate the…

Software Engineering · Computer Science 2025-12-17 Yanli Wang , Yanlin Wang , Suiquan Wang , Daya Guo , Jiachi Chen , John Grundy , Xilin Liu , Yuchi Ma , Mingzhi Mao , Hongyu Zhang , Zibin Zheng

Evaluating Repository-level Software Documentation via Question Answering and Feature-Driven Development

Software documentation is crucial for repository comprehension. While Large Language Models (LLMs) advance documentation generation from code snippets to entire repositories, existing benchmarks have two key limitations: (1) they lack a…

Software Engineering · Computer Science 2026-04-09 Xinchen Wang , Ruida Hu , Cuiyun Gao , Pengfei Gao , Chao Peng

SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback

We introduce SWE-PRBench, a benchmark of 350 pull requests with human-annotated ground truth for evaluating AI code review quality. Evaluated against an LLM-as-judge framework validated at kappa=0.75, 8 frontier models detect only 15-31% of…

Software Engineering · Computer Science 2026-03-30 Deepak Kumar

RepoMasterEval: Evaluating Code Completion via Real-World Repositories

With the growing reliance on automated code completion tools in software development, the need for comprehensive evaluation benchmarks has become critical. Existing benchmarks focus more on code completion in function and class level by…

Software Engineering · Computer Science 2025-11-03 Qinyun Wu , Chao Peng , Pengfei Gao , Ruida Hu , Haoyu Gan , Bo Jiang , Jinhe Tang , Zhiwen Deng , Zhanming Guan , Cuiyun Gao , Xia Liu , Ping Yang

TimeMachine-bench: A Benchmark for Evaluating Model Capabilities in Repository-Level Migration Tasks

With the advancement of automated software engineering, research focus is increasingly shifting toward practical tasks reflecting the day-to-day work of software engineers. Among these tasks, software migration, a critical process of…

Software Engineering · Computer Science 2026-04-29 Ryo Fujii , Makoto Morishita , Kazuki Yano , Jun Suzuki

On the Time-Based Conclusion Stability of Cross-Project Defect Prediction Models

Researchers in empirical software engineering often make claims based on observable data such as defect reports. Unfortunately, in many cases, these claims are generalized beyond the data sets that have been evaluated. Will the researcher's…

Software Engineering · Computer Science 2020-08-10 Abdul Ali Bangash , Hareem Sahar , Abram Hindle , Karim Ali

SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often overlook the long tail topics and rely on popular repositories…

Software Engineering · Computer Science 2026-03-18 Songcheng Cai , Zhiheng Lyu , Yuansheng Ni , Xiangchao Chen , Baichuan Zhou , Shenzhe Zhu , Yi Lu , Haozhe Wang , Chi Ruan , Benjamin Schneider , Weixu Zhang , Xiang Li , Andy Zheng , Yuyu Zhang , Ping Nie , Wenhu Chen

R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models

Code completion models have made significant progress in recent years. Recently, repository-level code completion has drawn more attention in modern software development, and several baseline methods and benchmarks have been proposed.…

Computation and Language · Computer Science 2025-09-05 Ken Deng , Jiaheng Liu , He Zhu , Congnan Liu , Jingxin Li , Jiakai Wang , Peng Zhao , Chenchen Zhang , Yanan Wu , Xueqiao Yin , Yuanxing Zhang , Zizheng Zhan , Wenbo Su , Bangyu Xiang , Tiezheng Ge , Bo Zheng

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System

The integration of Large Language Models (LLMs) into software engineering has driven a transition from traditional rule-based systems to autonomous agentic systems capable of solving complex problems. However, systematic progress is…

Software Engineering · Computer Science 2025-10-24 Jiale Guo , Suizhi Huang , Mei Li , Dong Huang , Xingsheng Chen , Regina Zhang , Zhijiang Guo , Han Yu , Siu-Ming Yiu , Pietro Lio , Kwok-Yan Lam

Reporting LLM Prompting in Automated Software Engineering: A Guideline Based on Current Practices and Expectations

Large Language Models, particularly decoder-only generative models such as GPT, are increasingly used to automate Software Engineering tasks. These models are primarily guided through natural language prompts, making prompt engineering a…

Software Engineering · Computer Science 2026-01-06 Alexander Korn , Lea Zaruchas , Chetan Arora , Andreas Metzger , Sven Smolka , Fanyu Wang , Andreas Vogelsang

Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA

We present Code-QA-Bench, a fully automated framework for synthesizing repository-level code understanding benchmarks that separates genuine code comprehension from documentation recall and pretraining memorization. The framework makes two…

Software Engineering · Computer Science 2026-05-29 Jun Zhang , JianYing Qu , Hanwen Du , Zhongkai Sun , Yehua Yang , Qiao Zhao

CoReQA: Uncovering Potentials of Language Models in Code Repository Question Answering

Large language models that enhance software development tasks, such as code generation, code completion, and code question answering (QA), have been extensively studied in both academia and the industry. The models are integrated into…

Software Engineering · Computer Science 2025-01-08 Jialiang Chen , Kaifa Zhao , Jie Liu , Chao Peng , Jierui Liu , Hang Zhu , Pengfei Gao , Ping Yang , Shuiguang Deng

REVERE: Reflective Evolving Research Engineer for Scientific Workflows

Existing prompt-optimization techniques rely on local signals to update behavior, often neglecting broader and recurring patterns across tasks, leading to poor generalization; they further rely on full-prompt rewrites or unstructured…

Software Engineering · Computer Science 2026-03-24 Balaji Dinesh Gangireddi , Aniketh Garikaparthi , Manasi Patwardhan , Arman Cohan

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Optimizing the performance of large-scale software repositories demands expertise in code reasoning and software engineering (SWE) to reduce runtime while preserving program correctness. However, most benchmarks emphasize what to fix rather…

Software Engineering · Computer Science 2025-11-12 Jeffrey Jian Ma , Milad Hashemi , Amir Yazdanbakhsh , Kevin Swersky , Ofir Press , Enhui Li , Vijay Janapa Reddi , Parthasarathy Ranganathan

Using Temporal and Semantic Developer-Level Information to Predict Maintenance Activity Profiles

Predictive models for software projects' characteristics have been traditionally based on project-level metrics, employing only little developer-level information, or none at all. In this work we suggest novel metrics that capture temporal…

Software Engineering · Computer Science 2016-12-01 Stanislav Levin , Amiram Yehudai