English

Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation

Machine Learning 2023-01-31 v1 Machine Learning

Abstract

We study reinforcement learning with linear function approximation and adversarially changing cost functions, a setup that has mostly been considered under simplifying assumptions such as full information feedback or exploratory conditions.We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback, featuring a combination of mirror-descent and least squares policy evaluation in an auxiliary MDP used to compute exploration bonuses.Our algorithm obtains an O~(K6/7)\widetilde O(K^{6/7}) regret bound, improving significantly over previous state-of-the-art of O~(K14/15)\widetilde O (K^{14/15}) in this setting. In addition, we present a version of the same algorithm under the assumption a simulator of the environment is available to the learner (but otherwise no exploratory assumptions are made), and prove it obtains state-of-the-art regret of O~(K2/3)\widetilde O (K^{2/3}).

Keywords

Cite

@article{arxiv.2301.13087,
  title  = {Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation},
  author = {Uri Sherman and Tomer Koren and Yishay Mansour},
  journal= {arXiv preprint arXiv:2301.13087},
  year   = {2023}
}