Provably Efficient Exploration in Policy Optimization

Qi Cai; Zhuoran Yang; Chi Jin; Zhaoran Wang

Provably Efficient Exploration in Policy Optimization

Machine Learning 2024-04-02 v4 Optimization and Control Machine Learning

Authors: Qi Cai , Zhuoran Yang , Chi Jin , Zhaoran Wang

Abstract

While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an ``optimistic version'' of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves $\tilde{O}(\sqrt{d^2 H^3 T} )$ regret. Here $d$ is the feature dimension, $H$ is the episode horizon, and $T$ is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.

Keywords

policy gradient hyperparameter optimization markov decision processes

Cite

@article{arxiv.1912.05830,
  title  = {Provably Efficient Exploration in Policy Optimization},
  author = {Qi Cai and Zhuoran Yang and Chi Jin and Zhaoran Wang},
  journal= {arXiv preprint arXiv:1912.05830},
  year   = {2024}
}

Comments

We have fixed a technical issue in the first version of this paper. We remark the technical assumption of the linear MDP in this version of the paper is different from that in the first version

Provably Efficient Exploration in Policy Optimization

Abstract

Keywords

Cite

Comments

Related papers