English

Expected Sarsa($\lambda$) with Control Variate for Variance Reduction

Machine Learning 2019-09-09 v2 Artificial Intelligence Machine Learning

Abstract

Off-policy learning is powerful for reinforcement learning. However, the high variance of off-policy evaluation is a critical challenge, which causes off-policy learning falls into an uncontrolled instability. In this paper, for reducing the variance, we introduce control variate technique to Expected\mathtt{Expected} Sarsa\mathtt{Sarsa}(λ\lambda) and propose a tabular ES\mathtt{ES}(λ\lambda)-CV\mathtt{CV} algorithm. We prove that if a proper estimator of value function reaches, the proposed ES\mathtt{ES}(λ\lambda)-CV\mathtt{CV} enjoys a lower variance than Expected\mathtt{Expected} Sarsa\mathtt{Sarsa}(λ\lambda). Furthermore, to extend ES\mathtt{ES}(λ\lambda)-CV\mathtt{CV} to be a convergent algorithm with linear function approximation, we propose the GES\mathtt{GES}(λ\lambda) algorithm under the convex-concave saddle-point formulation. We prove that the convergence rate of GES\mathtt{GES}(λ\lambda) achieves O(1/T)\mathcal{O}(1/T), which matches or outperforms lots of state-of-art gradient-based algorithms, but we use a more relaxed condition. Numerical experiments show that the proposed algorithm performs better with lower variance than several state-of-art gradient-based TD learning algorithms: GQ\mathtt{GQ}(λ\lambda), GTB\mathtt{GTB}(λ\lambda) and ABQ\mathtt{ABQ}(ζ\zeta).

Keywords

Cite

@article{arxiv.1906.11058,
  title  = {Expected Sarsa($\lambda$) with Control Variate for Variance Reduction},
  author = {Long Yang and Yu Zhang and Jun Wen and Qian Zheng and Pengfei Li and Gang Pan},
  journal= {arXiv preprint arXiv:1906.11058},
  year   = {2019}
}
R2 v1 2026-06-23T10:04:10.915Z