English

EpiCoder: Encompassing Diversity and Complexity in Code Generation

Computation and Language 2025-10-10 v3 Artificial Intelligence

Abstract

Existing methods for code generation use code snippets as seed data, restricting the complexity and diversity of the synthesized data. In this paper, we introduce a novel feature tree-based synthesis framework, which revolves around hierarchical code features derived from high-level abstractions of code. The feature tree is constructed from raw data and refined iteratively to increase the quantity and diversity of the extracted features, which captures and recognizes more complex patterns and relationships within the code. By adjusting the depth and breadth of the sampled subtrees, our framework provides precise control over the complexity of the generated code, enabling functionalities that range from function-level operations to multi-file scenarios. We fine-tuned widely-used base models to obtain EpiCoder series, achieving state-of-the-art performance on multiple benchmarks at both the function and file levels. In particular, empirical evidence indicates that our approach shows significant potential in the synthesizing of repository-level code data. Our code and data are publicly available at https://github.com/microsoft/EpiCoder.

Keywords

Cite

@article{arxiv.2501.04694,
  title  = {EpiCoder: Encompassing Diversity and Complexity in Code Generation},
  author = {Yaoxiang Wang and Haoling Li and Xin Zhang and Jie Wu and Xiao Liu and Wenxiang Hu and Zhongxin Guo and Yangyu Huang and Ying Xin and Yujiu Yang and Jinsong Su and Qi Chen and Scarlett Li},
  journal= {arXiv preprint arXiv:2501.04694},
  year   = {2025}
}

Comments

ICML 2025

R2 v1 2026-06-28T21:00:11.550Z