AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

Zihan Wang; Jiaze Chen; Zhicheng Liu; Markus Mak; Yidi Du; Geonsik Moon; Luoqi Xu; Aaron Tua; Kunshuo Peng; Jiayi Lu; Mingfei Xia; Boqian Zou; Chenyang Ran; Guang Tian; Shoutai Zhu; Yeheng Duan; Zhenghui Kang; Zhenxing Lin; Shangshu Li; Qiang Luo; Qingshen Long; Zhiyong Chen; Yihan Xiao; Yurong Wu; Daoguang Zan; Yuyi Fu; Mingxuan Wang; Ming Ding

AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

Software Engineering 2025-08-25 v1 Computation and Language

Authors: Zihan Wang , Jiaze Chen , Zhicheng Liu , Markus Mak , Yidi Du , Geonsik Moon , Luoqi Xu , Aaron Tua , Kunshuo Peng , Jiayi Lu , Mingfei Xia , Boqian Zou , Chenyang Ran , Guang Tian , Shoutai Zhu , Yeheng Duan , Zhenghui Kang , Zhenxing Lin , Shangshu Li , Qiang Luo , Qingshen Long , Zhiyong Chen , Yihan Xiao , Yurong Wu , Daoguang Zan , Yuyi Fu , Mingxuan Wang , Ming Ding

View on arXiv ↗ PDF ↗

Abstract

Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs). Despite impressive progress on existing benchmarks, we argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers. This gap arises from two key limitations: insufficient difficulty and scope of benchmark problems, and evaluation bias from low-quality test cases. To address these shortcomings, we present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and ICPC, offering broader coverage and higher difficulty. AetherCode further incorporates comprehensive, expert-validated test suites built through a hybrid of automated generation and human curation, ensuring rigorous and reliable assessment. By combining challenging problem design with robust evaluation, AetherCode provides a more faithful measure of LLM capabilities and sets a new standard for future research in code reasoning.

Keywords

code generation large language model evaluation large language model

Cite

@article{arxiv.2508.16402,
  title  = {AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions},
  author = {Zihan Wang and Jiaze Chen and Zhicheng Liu and Markus Mak and Yidi Du and Geonsik Moon and Luoqi Xu and Aaron Tua and Kunshuo Peng and Jiayi Lu and Mingfei Xia and Boqian Zou and Chenyang Ran and Guang Tian and Shoutai Zhu and Yeheng Duan and Zhenghui Kang and Zhenxing Lin and Shangshu Li and Qiang Luo and Qingshen Long and Zhiyong Chen and Yihan Xiao and Yurong Wu and Daoguang Zan and Yuyi Fu and Mingxuan Wang and Ming Ding},
  journal= {arXiv preprint arXiv:2508.16402},
  year   = {2025}
}

Comments

15 pages

AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

Abstract

Keywords

Cite

Comments

Related papers