English
Related papers

Related papers: ATime-Consistent Benchmark for Repository-Level So…

200 papers

To evaluate software maintenance techniques and tools in controlled experiments with human participants, researchers currently use projects and tasks selected on an ad-hoc basis. This can unrealistically favor their tool, and it makes the…

Software Engineering · Computer Science 2020-12-01 Matúš Sulír

In the era of large language models (LLMs), code benchmarks have become an important research area in software engineering and are widely used by practitioners. These benchmarks evaluate the performance of LLMs on specific code-related…

Software Engineering · Computer Science 2025-06-24 Zhiyuan Pan , Xing Hu , Xin Xia , Xiaohu Yang

Code agents are currently having skillful performance on repository-level software engineering benchmarks, but it remains unclear whether success on end-to-end tasks such as issue resolution truly reflects repository context reasoning, the…

Software Engineering · Computer Science 2026-05-27 Hanyu Li , Yichi Zhang , Speed Zhu , Hang Su , Jun Zhu , Yinpeng Dong

The evolution of AI coding agents has shifted the frontier from simple snippet completion to autonomous repository-level engineering. However, evaluating these agents remains ill-posed in general code repository generation, where the lack…

Software Engineering · Computer Science 2026-02-27 Xuefeng Li , Nir Ben-Israel , Yotam Raz , Belal Ahmed , Doron Serebro , Antoine Raux

In this work, we introduce CodeRepoQA, a large-scale benchmark specifically designed for evaluating repository-level question-answering capabilities in the field of software engineering. CodeRepoQA encompasses five programming languages and…

Software Engineering · Computer Science 2024-12-20 Ruida Hu , Chao Peng , Jingyi Ren , Bo Jiang , Xiangxin Meng , Qinyun Wu , Pengfei Gao , Xinchen Wang , Cuiyun Gao

Repository-level code translation refers to translating an entire code repository from one programming language to another while preserving the functionality of the source repository. Many benchmarks have been proposed to evaluate the…

Software Engineering · Computer Science 2025-12-17 Yanli Wang , Yanlin Wang , Suiquan Wang , Daya Guo , Jiachi Chen , John Grundy , Xilin Liu , Yuchi Ma , Mingzhi Mao , Hongyu Zhang , Zibin Zheng

Software documentation is crucial for repository comprehension. While Large Language Models (LLMs) advance documentation generation from code snippets to entire repositories, existing benchmarks have two key limitations: (1) they lack a…

Software Engineering · Computer Science 2026-04-09 Xinchen Wang , Ruida Hu , Cuiyun Gao , Pengfei Gao , Chao Peng

We introduce SWE-PRBench, a benchmark of 350 pull requests with human-annotated ground truth for evaluating AI code review quality. Evaluated against an LLM-as-judge framework validated at kappa=0.75, 8 frontier models detect only 15-31% of…

Software Engineering · Computer Science 2026-03-30 Deepak Kumar

With the growing reliance on automated code completion tools in software development, the need for comprehensive evaluation benchmarks has become critical. Existing benchmarks focus more on code completion in function and class level by…

Software Engineering · Computer Science 2025-11-03 Qinyun Wu , Chao Peng , Pengfei Gao , Ruida Hu , Haoyu Gan , Bo Jiang , Jinhe Tang , Zhiwen Deng , Zhanming Guan , Cuiyun Gao , Xia Liu , Ping Yang

With the advancement of automated software engineering, research focus is increasingly shifting toward practical tasks reflecting the day-to-day work of software engineers. Among these tasks, software migration, a critical process of…

Software Engineering · Computer Science 2026-04-29 Ryo Fujii , Makoto Morishita , Kazuki Yano , Jun Suzuki

Researchers in empirical software engineering often make claims based on observable data such as defect reports. Unfortunately, in many cases, these claims are generalized beyond the data sets that have been evaluated. Will the researcher's…

Software Engineering · Computer Science 2020-08-10 Abdul Ali Bangash , Hareem Sahar , Abram Hindle , Karim Ali

Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often overlook the long tail topics and rely on popular repositories…

Code completion models have made significant progress in recent years. Recently, repository-level code completion has drawn more attention in modern software development, and several baseline methods and benchmarks have been proposed.…

The integration of Large Language Models (LLMs) into software engineering has driven a transition from traditional rule-based systems to autonomous agentic systems capable of solving complex problems. However, systematic progress is…

Software Engineering · Computer Science 2025-10-24 Jiale Guo , Suizhi Huang , Mei Li , Dong Huang , Xingsheng Chen , Regina Zhang , Zhijiang Guo , Han Yu , Siu-Ming Yiu , Pietro Lio , Kwok-Yan Lam

Large Language Models, particularly decoder-only generative models such as GPT, are increasingly used to automate Software Engineering tasks. These models are primarily guided through natural language prompts, making prompt engineering a…

Software Engineering · Computer Science 2026-01-06 Alexander Korn , Lea Zaruchas , Chetan Arora , Andreas Metzger , Sven Smolka , Fanyu Wang , Andreas Vogelsang

We present Code-QA-Bench, a fully automated framework for synthesizing repository-level code understanding benchmarks that separates genuine code comprehension from documentation recall and pretraining memorization. The framework makes two…

Software Engineering · Computer Science 2026-05-29 Jun Zhang , JianYing Qu , Hanwen Du , Zhongkai Sun , Yehua Yang , Qiao Zhao

Large language models that enhance software development tasks, such as code generation, code completion, and code question answering (QA), have been extensively studied in both academia and the industry. The models are integrated into…

Software Engineering · Computer Science 2025-01-08 Jialiang Chen , Kaifa Zhao , Jie Liu , Chao Peng , Jierui Liu , Hang Zhu , Pengfei Gao , Ping Yang , Shuiguang Deng

Existing prompt-optimization techniques rely on local signals to update behavior, often neglecting broader and recurring patterns across tasks, leading to poor generalization; they further rely on full-prompt rewrites or unstructured…

Software Engineering · Computer Science 2026-03-24 Balaji Dinesh Gangireddi , Aniketh Garikaparthi , Manasi Patwardhan , Arman Cohan

Optimizing the performance of large-scale software repositories demands expertise in code reasoning and software engineering (SWE) to reduce runtime while preserving program correctness. However, most benchmarks emphasize what to fix rather…

Predictive models for software projects' characteristics have been traditionally based on project-level metrics, employing only little developer-level information, or none at all. In this work we suggest novel metrics that capture temporal…

Software Engineering · Computer Science 2016-12-01 Stanislav Levin , Amiram Yehudai
‹ Prev 1 2 3 10 Next ›