English

Sample Efficient Policy Gradient Methods with Recursive Variance Reduction

Machine Learning 2021-08-03 v3 Optimization and Control Machine Learning

Abstract

Improving the sample efficiency in reinforcement learning has been a long-standing research problem. In this work, we aim to reduce the sample complexity of existing policy gradient methods. We propose a novel policy gradient algorithm called SRVR-PG, which only requires O(1/ϵ3/2)O(1/\epsilon^{3/2}) episodes to find an ϵ\epsilon-approximate stationary point of the nonconcave performance function J(θ)J(\boldsymbol{\theta}) (i.e., θ\boldsymbol{\theta} such that J(θ)22ϵ\|\nabla J(\boldsymbol{\theta})\|_2^2\leq\epsilon). This sample complexity improves the existing result O(1/ϵ5/3)O(1/\epsilon^{5/3}) for stochastic variance reduced policy gradient algorithms by a factor of O(1/ϵ1/6)O(1/\epsilon^{1/6}). In addition, we also propose a variant of SRVR-PG with parameter exploration, which explores the initial policy parameter from a prior probability distribution. We conduct numerical experiments on classic control problems in reinforcement learning to validate the performance of our proposed algorithms.

Keywords

Cite

@article{arxiv.1909.08610,
  title  = {Sample Efficient Policy Gradient Methods with Recursive Variance Reduction},
  author = {Pan Xu and Felicia Gao and Quanquan Gu},
  journal= {arXiv preprint arXiv:1909.08610},
  year   = {2021}
}

Comments

23 pages, 2 figures, 3 tables. In ICLR 2020

R2 v1 2026-06-23T11:19:30.849Z