English

Differentiable Evolutionary Reinforcement Learning

Artificial Intelligence 2026-05-14 v2 Computation and Language

Abstract

Crafting effective reward signals remains a central challenge in Reinforcement Learning (RL), especially for complex reasoning tasks. Existing automated reward optimization methods typically rely on derivative-free search heuristics that treat the reward function as a black box, failing to exploit the causal dynamics between reward structure modifications and policy performance. We introduce Differentiable Evolutionary Reinforcement Learning (DERL), a bi-level framework for the autonomous discovery of optimal reward structures. DERL employs a Meta-Optimizer that evolves a reward function through the composition of structured atomic primitives to guide an inner-loop policy. Unlike prior black-box methods, DERL introduces differentiability into the meta-optimization process by updating the Meta-Optimizer using policy gradients derived from inner-loop validation performance. This allows for the progressive learning of a "meta-gradient" for task success, providing the system with dense, actionable feedback. We validate DERL across diverse reasoning domains: embodied agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8K, MATH). Results show that DERL achieves state-of-the-art performance on agent benchmarks, substantially outperforming non-differentiable baselines-especially in out-of-distribution generalization. Trajectory analyses confirm that DERL captures the intrinsic causal structure of tasks, enabling fully autonomous, self-improving agent alignment.

Keywords

Cite

@article{arxiv.2512.13399,
  title  = {Differentiable Evolutionary Reinforcement Learning},
  author = {Sitao Cheng and Tianle Li and Xuhan Huang and Xunjian Yin and Difan Zou},
  journal= {arXiv preprint arXiv:2512.13399},
  year   = {2026}
}

Comments

Work in Progress. We release our code and model at https://github.com/sitaocheng/DERL