Multi-Turn Code Generation Through Single-Step Rewards

Arnav Kumar Jain; Gonzalo Gonzalez-Pumariega; Wayne Chen; Alexander M Rush; Wenting Zhao; Sanjiban Choudhury

Multi-Turn Code Generation Through Single-Step Rewards

Machine Learning 2025-06-30 v2 Artificial Intelligence Computation and Language

Authors: Arnav Kumar Jain , Gonzalo Gonzalez-Pumariega , Wayne Chen , Alexander M Rush , Wenting Zhao , Sanjiban Choudhury

View on arXiv ↗ PDF ↗

Abstract

We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, $\mu$ Code, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. $\mu$ Code iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $\mu$ Code at utilizing the execution feedback. Our code is available at https://github.com/portal-cornell/muCode.

Keywords

code generation

Cite

@article{arxiv.2502.20380,
  title  = {Multi-Turn Code Generation Through Single-Step Rewards},
  author = {Arnav Kumar Jain and Gonzalo Gonzalez-Pumariega and Wayne Chen and Alexander M Rush and Wenting Zhao and Sanjiban Choudhury},
  journal= {arXiv preprint arXiv:2502.20380},
  year   = {2025}
}

Comments

9 pages (not including references or appendix); 5 figures (in main paper); (v2) camera-ready version

Multi-Turn Code Generation Through Single-Step Rewards

Abstract

Keywords

Cite

Comments

Related papers