Expected Sarsa($\lambda$) with Control Variate for Variance Reduction
Abstract
Off-policy learning is powerful for reinforcement learning. However, the high variance of off-policy evaluation is a critical challenge, which causes off-policy learning falls into an uncontrolled instability. In this paper, for reducing the variance, we introduce control variate technique to () and propose a tabular ()- algorithm. We prove that if a proper estimator of value function reaches, the proposed ()- enjoys a lower variance than (). Furthermore, to extend ()- to be a convergent algorithm with linear function approximation, we propose the () algorithm under the convex-concave saddle-point formulation. We prove that the convergence rate of () achieves , which matches or outperforms lots of state-of-art gradient-based algorithms, but we use a more relaxed condition. Numerical experiments show that the proposed algorithm performs better with lower variance than several state-of-art gradient-based TD learning algorithms: (), () and ().
Keywords
Cite
@article{arxiv.1906.11058,
title = {Expected Sarsa($\lambda$) with Control Variate for Variance Reduction},
author = {Long Yang and Yu Zhang and Jun Wen and Qian Zheng and Pengfei Li and Gang Pan},
journal= {arXiv preprint arXiv:1906.11058},
year = {2019}
}