RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback

Chunyu Miao; Henry Peng Zou; Yangning Li; Yankai Chen; Yibo Wang; Fangxin Wang; Yifan Li; Wooseong Yang; Bowei He; Xinni Zhang; Dianzhi Yu; Hanchen Yang; Hoang H Nguyen; Yue Zhou; Jie Yang; Jizhou Guo; Wenzhe Fan; Chin-Yuan Yeh; Panpan Meng; Liancheng Fang; Jinhu Qi; Wei-Chieh Huang; Zhengyao Gu; Yuwei Han; Langzhou He; Yuyao Yang; Yinghui Li; Hai-Tao Zheng; Xue Liu; Irwin King; Philip S. Yu

RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback

Computation and Language 2025-10-27 v2 Artificial Intelligence

Authors: Chunyu Miao , Henry Peng Zou , Yangning Li , Yankai Chen , Yibo Wang , Fangxin Wang , Yifan Li , Wooseong Yang , Bowei He , Xinni Zhang , Dianzhi Yu , Hanchen Yang , Hoang H Nguyen , Yue Zhou , Jie Yang , Jizhou Guo , Wenzhe Fan , Chin-Yuan Yeh , Panpan Meng , Liancheng Fang , Jinhu Qi , Wei-Chieh Huang , Zhengyao Gu , Yuwei Han , Langzhou He , Yuyao Yang , Yinghui Li , Hai-Tao Zheng , Xue Liu , Irwin King , Philip S. Yu

View on arXiv ↗ PDF ↗

Abstract

Large language models (LLMs) show the promise in supporting scientific research implementation, yet their ability to generate correct and executable code remains limited. Existing works largely adopt one-shot settings, ignoring the iterative and feedback-driven nature of realistic workflows of scientific research development. To address this gap, we present RECODE-H, a benchmark of 102 tasks from research papers and repositories that evaluates LLM agents through multi-turn interactions with LLM-simulated human feedback. It includes structured instructions,unit tests, and a five-level feedback hierarchy to reflect realistic researcher-agent collaboration. We further present ReCodeAgent, a framework that integrates feedback into iterative code generation. Experiments with leading LLMs, including GPT-5, Claude-Sonnet-4, DeepSeek-V3.1, and Gemini 2.5, show substantial performance gains with richer feedback, while also highlighting ongoing challenges in the generation of complex research code. RECODE-H establishes a foundation for developing adaptive, feedback-driven LLM agents in scientific research implementation

Keywords

code generation large language model evaluation large language model

Cite

@article{arxiv.2510.06186,
  title  = {RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback},
  author = {Chunyu Miao and Henry Peng Zou and Yangning Li and Yankai Chen and Yibo Wang and Fangxin Wang and Yifan Li and Wooseong Yang and Bowei He and Xinni Zhang and Dianzhi Yu and Hanchen Yang and Hoang H Nguyen and Yue Zhou and Jie Yang and Jizhou Guo and Wenzhe Fan and Chin-Yuan Yeh and Panpan Meng and Liancheng Fang and Jinhu Qi and Wei-Chieh Huang and Zhengyao Gu and Yuwei Han and Langzhou He and Yuyao Yang and Yinghui Li and Hai-Tao Zheng and Xue Liu and Irwin King and Philip S. Yu},
  journal= {arXiv preprint arXiv:2510.06186},
  year   = {2025}
}

Comments

Code and dataset are available at github.com/ChunyuMiao98/RECODE

RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback

Abstract

Keywords

Cite

Comments

Related papers