EpiCoder: Encompassing Diversity and Complexity in Code Generation

Yaoxiang Wang; Haoling Li; Xin Zhang; Jie Wu; Xiao Liu; Wenxiang Hu; Zhongxin Guo; Yangyu Huang; Ying Xin; Yujiu Yang; Jinsong Su; Qi Chen; Scarlett Li

EpiCoder: Encompassing Diversity and Complexity in Code Generation

Computation and Language 2025-10-10 v3 Artificial Intelligence

Authors: Yaoxiang Wang , Haoling Li , Xin Zhang , Jie Wu , Xiao Liu , Wenxiang Hu , Zhongxin Guo , Yangyu Huang , Ying Xin , Yujiu Yang , Jinsong Su , Qi Chen , Scarlett Li

View on arXiv ↗ PDF ↗

Abstract

Existing methods for code generation use code snippets as seed data, restricting the complexity and diversity of the synthesized data. In this paper, we introduce a novel feature tree-based synthesis framework, which revolves around hierarchical code features derived from high-level abstractions of code. The feature tree is constructed from raw data and refined iteratively to increase the quantity and diversity of the extracted features, which captures and recognizes more complex patterns and relationships within the code. By adjusting the depth and breadth of the sampled subtrees, our framework provides precise control over the complexity of the generated code, enabling functionalities that range from function-level operations to multi-file scenarios. We fine-tuned widely-used base models to obtain EpiCoder series, achieving state-of-the-art performance on multiple benchmarks at both the function and file levels. In particular, empirical evidence indicates that our approach shows significant potential in the synthesizing of repository-level code data. Our code and data are publicly available at https://github.com/microsoft/EpiCoder.

Keywords

code generation program analysis

Cite

@article{arxiv.2501.04694,
  title  = {EpiCoder: Encompassing Diversity and Complexity in Code Generation},
  author = {Yaoxiang Wang and Haoling Li and Xin Zhang and Jie Wu and Xiao Liu and Wenxiang Hu and Zhongxin Guo and Yangyu Huang and Ying Xin and Yujiu Yang and Jinsong Su and Qi Chen and Scarlett Li},
  journal= {arXiv preprint arXiv:2501.04694},
  year   = {2025}
}

Comments

ICML 2025

EpiCoder: Encompassing Diversity and Complexity in Code Generation

Abstract

Keywords

Cite

Comments

Related papers