Related papers: ExecRepoBench: Multi-level Executable Code Complet…

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

Large Language Models (LLMs) have greatly advanced code auto-completion systems, with a potential for substantial productivity enhancements for developers. However, current benchmarks mainly focus on single-file tasks, leaving an assessment…

Computation and Language · Computer Science 2023-10-05 Tianyang Liu , Canwen Xu , Julian McAuley

RepoMasterEval: Evaluating Code Completion via Real-World Repositories

With the growing reliance on automated code completion tools in software development, the need for comprehensive evaluation benchmarks has become critical. Existing benchmarks focus more on code completion in function and class level by…

Software Engineering · Computer Science 2025-11-03 Qinyun Wu , Chao Peng , Pengfei Gao , Ruida Hu , Haoyu Gan , Bo Jiang , Jinhe Tang , Zhiwen Deng , Zhanming Guan , Cuiyun Gao , Xia Liu , Ping Yang

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation

The task of repository-level code completion is to continue writing the unfinished code based on a broader context of the repository. While for automated code completion tools, it is difficult to utilize the useful information scattered in…

Computation and Language · Computer Science 2023-10-23 Fengji Zhang , Bei Chen , Yue Zhang , Jacky Keung , Jin Liu , Daoguang Zan , Yi Mao , Jian-Guang Lou , Weizhu Chen

Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'

Recently, a number of repository-level code generation benchmarks-such as CoderEval, DevEval, RepoEval, RepoBench, and LongCodeArena-have emerged to evaluate the capabilities of large language models (LLMs) beyond standalone benchmarks like…

Software Engineering · Computer Science 2025-06-26 Shanchao Liang , Yiran Hu , Nan Jiang , Lin Tan

SecRepoBench: Benchmarking Code Agents for Secure Code Completion in Real-World Repositories

This paper introduces SecRepoBench, a benchmark to evaluate code agents on secure code completion in real-world repositories. SecRepoBench has 318 code completion tasks in 27 C/C++ repositories, covering 15 CWEs. We evaluate 29 standalone…

Cryptography and Security · Computer Science 2026-02-17 Chihao Shen , Connor Dilgren , Purva Chiniya , Luke Griffith , Yu Ding , Yizheng Chen

Evaluating and Achieving Controllable Code Completion in Code LLM

Code completion has become a central task, gaining significant attention with the rise of large language model (LLM)-based tools in software engineering. Although recent advances have greatly improved LLMs' code completion abilities,…

Software Engineering · Computer Science 2026-01-23 Jiajun Zhang , Zeyu Cui , Lei Zhang , Jian Yang , Jiaxi Yang , Qiang Liu , Zilei Wang , Binyuan Hui , Liang Wang , Junyang Lin

RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

Writing code requires significant time and effort in software development. To automate this process, researchers have made substantial progress using Large Language Models (LLMs) for code generation. Many benchmarks like HumanEval and…

Software Engineering · Computer Science 2026-04-27 Jia Li , Hongyi Deng , Yiran Zhang , Kechi Zhang , Tianqi Shao , Tiankuo Zhao , Weinan Wang , Zhi Jin , Ge Li , Yang Liu , Yingtao Fang , Yihong Dong

RepoTransBench: A Real-World Multilingual Benchmark for Repository-Level Code Translation

Repository-level code translation refers to translating an entire code repository from one programming language to another while preserving the functionality of the source repository. Many benchmarks have been proposed to evaluate the…

Software Engineering · Computer Science 2025-12-17 Yanli Wang , Yanlin Wang , Suiquan Wang , Daya Guo , Jiachi Chen , John Grundy , Xilin Liu , Yuchi Ma , Mingzhi Mao , Hongyu Zhang , Zibin Zheng

On the Impacts of Contexts on Repository-Level Code Generation

CodeLLMs have gained widespread adoption for code generation tasks, yet their capacity to handle repository-level code generation with complex contextual dependencies remains underexplored. Our work underscores the critical importance of…

Software Engineering · Computer Science 2025-02-11 Nam Le Hai , Dung Manh Nguyen , Nghi D. Q. Bui

R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models

Code completion models have made significant progress in recent years. Recently, repository-level code completion has drawn more attention in modern software development, and several baseline methods and benchmarks have been proposed.…

Computation and Language · Computer Science 2025-09-05 Ken Deng , Jiaheng Liu , He Zhu , Congnan Liu , Jingxin Li , Jiakai Wang , Peng Zhao , Chenchen Zhang , Yanan Wu , Xueqiao Yin , Yuanxing Zhang , Zizheng Zhan , Wenbo Su , Bangyu Xiang , Tiezheng Ge , Bo Zheng

M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation

Repository-level code completion has drawn great attention in software engineering, and several benchmark datasets have been introduced. However, existing repository-level code completion benchmarks usually focus on a limited number of…

Computation and Language · Computer Science 2024-10-29 Jiaheng Liu , Ken Deng , Congnan Liu , Jian Yang , Shukai Liu , He Zhu , Peng Zhao , Linzheng Chai , Yanan Wu , Ke Jin , Ge Zhang , Zekun Wang , Guoan Zhang , Bangyu Xiang , Wenbo Su , Bo Zheng

CoreCodeBench: Decoupling Code Intelligence via Fine-Grained Repository-Level Tasks

The evaluation of Large Language Models (LLMs) for software engineering has shifted towards complex, repository-level tasks. However, existing benchmarks predominantly rely on coarse-grained pass rates that treat programming proficiency as…

Software Engineering · Computer Science 2026-01-08 Lingyue Fu , Hao Guan , Bolun Zhang , Haowei Yuan , Yaoming Zhu , Jun Xu , Zongyu Wang , Lin Qiu , Xunliang Cai , Xuezhi Cao , Weiwen Liu , Weinan Zhang , Yong Yu

Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository

LLMs have demonstrated significant potential in code generation tasks, achieving promising results at the function or statement level across various benchmarks. However, the complexities associated with creating code artifacts like classes,…

Software Engineering · Computer Science 2024-06-06 Ajinkya Deshpande , Anmol Agarwal , Shashank Shet , Arun Iyer , Aditya Kanade , Ramakrishna Bairi , Suresh Parthasarathy

MHRC-Bench: A Multilingual Hardware Repository-Level Code Completion benchmark

Large language models (LLMs) have achieved strong performance on code completion tasks in general-purpose programming languages. However, existing repository-level code completion benchmarks focus almost exclusively on software code and…

Programming Languages · Computer Science 2026-02-03 Qingyun Zou , Jiahao Cui , Nuo Chen , Bingsheng He , Weng-Fai Wong

CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation

Large language models (LLMs) can generate code from natural language, but the extent to which they capture intended program behavior remains unclear. Executable behavioral specifications, defined via preconditions and postconditions,…

Software Engineering · Computer Science 2026-04-15 Zaoyu Chen , Jianbo Dai , Boyu Zhu , Jingdong Wang , Huiming Wang , Xin Xu , Haoyang Yuan , Zhijiang Guo , Xiao-Ming Wu

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

How to evaluate Large Language Models (LLMs) in code generation is an open question. Existing benchmarks demonstrate poor alignment with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. This paper…

Computation and Language · Computer Science 2024-04-02 Jia Li , Ge Li , Xuanming Zhang , Yihong Dong , Zhi Jin

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

To adequately test modern code generation systems, evaluation benchmarks must execute and test the code generated by the system. However, these execution and testing requirements have largely limited benchmarks to settings where code is…

Software Engineering · Computer Science 2024-10-04 Yiqing Xie , Alex Xie , Divyanshu Sheth , Pengfei Liu , Daniel Fried , Carolyn Rose

McEval: Massively Multilingual Code Evaluation

Code large language models (LLMs) have shown remarkable advances in code understanding, completion, and generation tasks. Programming benchmarks, comprised of a selection of code challenges and corresponding test cases, serve as a standard…

Programming Languages · Computer Science 2024-06-12 Linzheng Chai , Shukai Liu , Jian Yang , Yuwei Yin , Ke Jin , Jiaheng Liu , Tao Sun , Ge Zhang , Changyu Ren , Hongcheng Guo , Zekun Wang , Boyang Wang , Xianjie Wu , Bing Wang , Tongliang Li , Liqun Yang , Sufeng Duan , Zhoujun Li

CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation

Large language models (LLMs) play a crucial role in software engineering, excelling in tasks like code generation and maintenance. However, existing benchmarks are often narrow in scope, focusing on a specific task and lack a comprehensive…

Software Engineering · Computer Science 2025-04-30 Wenjing Yin , Tianze Sun , Yijiong Yu , Jiawei Fang , Guangyao Su , Jiancheng Wang , Zekun Wang , Wei Wang , Ran Chen , Ziyun Dai , Shuai Yuan , Menghang Dong , Peng Luo , Dong Cao , Da Lei , Yajun Zhang , Hao Chen , Xiang Ma , Yong Liu , Weifeng Liu , Yuanjian Xu , Ji Pei

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

In recent years, the application of large language models (LLMs) to code-related tasks has gained significant attention. However, existing evaluation benchmarks often focus on limited scenarios, such as code generation or completion, which…

Software Engineering · Computer Science 2024-09-17 Jia Feng , Jiachen Liu , Cuiyun Gao , Chun Yong Chong , Chaozheng Wang , Shan Gao , Xin Xia