Related papers: On Pretraining for Project-Level Code Completion

RepoFusion: Training Code Models to Understand Your Repository

Despite the huge success of Large Language Models (LLMs) in coding assistants like GitHub Copilot, these models struggle to understand the context present in the repository (e.g., imports, parent classes, files with similar names, etc.),…

Machine Learning · Computer Science 2023-06-21 Disha Shrivastava , Denis Kocetkov , Harm de Vries , Dzmitry Bahdanau , Torsten Scholak

Data Engineering for Scaling Language Models to 128K Context

We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular \textit{the ability to utilize information at…

Computation and Language · Computer Science 2024-02-16 Yao Fu , Rameswar Panda , Xinyao Niu , Xiang Yue , Hannaneh Hajishirzi , Yoon Kim , Hao Peng

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation

The task of repository-level code completion is to continue writing the unfinished code based on a broader context of the repository. While for automated code completion tools, it is difficult to utilize the useful information scattered in…

Computation and Language · Computer Science 2023-10-23 Fengji Zhang , Bei Chen , Yue Zhang , Jacky Keung , Jin Liu , Daoguang Zan , Yi Mao , Jian-Guang Lou , Weizhu Chen

In Line with Context: Repository-Level Code Generation via Context Inlining

Repository-level code generation has attracted growing attention in recent years. Unlike function-level code generation, it requires the model to understand the entire repository, reasoning over complex dependencies across functions,…

Software Engineering · Computer Science 2026-05-07 Chao Hu , Wenhao Zeng , Yuling Shi , Beijun Shen , Xiaodong Gu

CATCODER: Repository-Level Code Generation with Relevant Code and Type Context

Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, repository-level code generation presents unique challenges, particularly due to the need to utilize information spread across…

Software Engineering · Computer Science 2025-11-24 Zhiyuan Pan , Xing Hu , Xin Xia , Xiaohu Yang

R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models

Code completion models have made significant progress in recent years. Recently, repository-level code completion has drawn more attention in modern software development, and several baseline methods and benchmarks have been proposed.…

Computation and Language · Computer Science 2025-09-05 Ken Deng , Jiaheng Liu , He Zhu , Congnan Liu , Jingxin Li , Jiakai Wang , Peng Zhao , Chenchen Zhang , Yanan Wu , Xueqiao Yin , Yuanxing Zhang , Zizheng Zhan , Wenbo Su , Bangyu Xiang , Tiezheng Ge , Bo Zheng

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce…

Software Engineering · Computer Science 2024-01-29 Daya Guo , Qihao Zhu , Dejian Yang , Zhenda Xie , Kai Dong , Wentao Zhang , Guanting Chen , Xiao Bi , Y. Wu , Y. K. Li , Fuli Luo , Yingfei Xiong , Wenfeng Liang

Enriching Source Code with Contextual Data for Code Completion Models: An Empirical Study

Transformer-based pre-trained models have recently achieved great results in solving many software engineering tasks including automatic code completion which is a staple in a developer's toolkit. While many have striven to improve the…

Computation and Language · Computer Science 2023-04-25 Tim van Dam , Maliheh Izadi , Arie van Deursen

GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model

The performance of repository-level code completion depends upon the effective leverage of both general and repository-specific knowledge. Despite the impressive capability of code LLMs in general code completion tasks, they often exhibit…

Software Engineering · Computer Science 2024-09-16 Wei Liu , Ailun Yu , Daoguang Zan , Bo Shen , Wei Zhang , Haiyan Zhao , Zhi Jin , Qianxiang Wang

ExecRepoBench: Multi-level Executable Code Completion Evaluation

Code completion has become an essential tool for daily software development. Existing evaluation benchmarks often employ static methods that do not fully capture the dynamic nature of real-world coding environments and face significant…

Computation and Language · Computer Science 2024-12-17 Jian Yang , Jiajun Zhang , Jiaxi Yang , Ke Jin , Lei Zhang , Qiyao Peng , Ken Deng , Yibo Miao , Tianyu Liu , Zeyu Cui , Binyuan Hui , Junyang Lin

aiXcoder-7B-v2: Training LLMs to Fully Utilize the Long Context in Repository-level Code Completion

Large Language Models (LLMs) have shown promising results in repository-level code completion, which completes code based on the in-file and cross-file context of a repository. The cross-file context typically contains different types of…

Software Engineering · Computer Science 2026-02-10 Jia Li , Hao Zhu , Huanyu Liu , Xianjie Shi , He Zong , Yihong Dong , Kechi Zhang , Siyuan Jiang , Zhi Jin , Ge Li

A Review of Repository Level Prompting for LLMs

As coding challenges become more complex, recent advancements in Large Language Models (LLMs) have led to notable successes, such as achieving a 94.6\% solve rate on the HumanEval benchmark. Concurrently, there is an increasing commercial…

Software Engineering · Computer Science 2023-12-19 Douglas Schonholtz

On The Importance of Reasoning for Context Retrieval in Repository-Level Code Editing

Recent advancements in code-fluent Large Language Models (LLMs) enabled the research on repository-level code editing. In such tasks, the model navigates and modifies the entire codebase of a project according to request. Hence, such tasks…

Software Engineering · Computer Science 2024-06-10 Alexander Kovrigin , Aleksandra Eliseeva , Yaroslav Zharov , Timofey Bryksin

AlignCoder: Aligning Retrieval with Target Intent for Repository-Level Code Completion

Repository-level code completion remains a challenging task for existing code large language models (code LLMs) due to their limited understanding of repository-specific context and domain knowledge. While retrieval-augmented generation…

Software Engineering · Computer Science 2026-01-28 Tianyue Jiang , Yanli Wang , Yanlin Wang , Daya Guo , Ensheng Shi , Yuchi Ma , Jiachi Chen , Zibin Zheng

Multi-task Learning based Pre-trained Language Model for Code Completion

Code completion is one of the most useful features in the Integrated Development Environments (IDEs), which can accelerate software development by suggesting the next probable token based on the contextual code in real-time. Recent studies…

Software Engineering · Computer Science 2021-01-01 Fang Liu , Ge Li , Yunfei Zhao , Zhi Jin

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

Recent large language models support inputs of up to 10 million tokens, yet they perform poorly on long-context tasks that require complex reasoning. Such tasks can be solved using only a subset of the input -- a proxy context -- rather…

Computation and Language · Computer Science 2026-05-25 Miao Li , Irina Saparina , Alexander Gurung , Mirella Lapata

LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models

Large language models (LLMs) face significant challenges in handling long-context tasks because of their limited effective context window size during pretraining, which restricts their ability to generalize over extended sequences.…

Computation and Language · Computer Science 2024-09-05 Zhiyuan Hu , Yuliang Liu , Jinman Zhao , Suyuchen Wang , Yan Wang , Wei Shen , Qing Gu , Anh Tuan Luu , See-Kiong Ng , Zhiwei Jiang , Bryan Hooi

In-context Pretraining: Language Modeling Beyond Document Boundaries

Large language models (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining…

Computation and Language · Computer Science 2024-06-25 Weijia Shi , Sewon Min , Maria Lomeli , Chunting Zhou , Margaret Li , Gergely Szilvasy , Rich James , Xi Victoria Lin , Noah A. Smith , Luke Zettlemoyer , Scott Yih , Mike Lewis

MultiCoder: Multi-Programming-Lingual Pre-Training for Low-Resource Code Completion

Code completion is a valuable topic in both academia and industry. Recently, large-scale mono-programming-lingual (MonoPL) pre-training models have been proposed to boost the performance of code completion. However, the code completion on…

Computation and Language · Computer Science 2022-12-20 Zi Gong , Yinpeng Guo , Pingyi Zhou , Cuiyun Gao , Yasheng Wang , Zenglin Xu

Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs

Some recently developed code large language models (Code LLMs) have been pre-trained on repository-level code data (Repo-Code LLMs), enabling these models to recognize repository structures and utilize cross-file information for code…

Computation and Language · Computer Science 2024-06-28 Lei Zhang , Yunshui Li , Jiaming Li , Xiaobo Xia , Jiaxi Yang , Run Luo , Minzheng Wang , Longze Chen , Junhao Liu , Min Yang