English

Policy Gradient for Rectangular Robust Markov Decision Processes

Machine Learning 2023-12-12 v2 Artificial Intelligence

Abstract

Policy gradient methods have become a standard for training reinforcement learning agents in a scalable and efficient manner. However, they do not account for transition uncertainty, whereas learning robust policies can be computationally expensive. In this paper, we introduce robust policy gradient (RPG), a policy-based method that efficiently solves rectangular robust Markov decision processes (MDPs). We provide a closed-form expression for the worst occupation measure. Incidentally, we find that the worst kernel is a rank-one perturbation of the nominal. Combining the worst occupation measure with a robust Q-value estimation yields an explicit form of the robust gradient. Our resulting RPG can be estimated from data with the same time complexity as its non-robust equivalent. Hence, it relieves the computational burden of convex optimization problems required for training robust policies by current policy gradient approaches.

Keywords

Cite

@article{arxiv.2301.13589,
  title  = {Policy Gradient for Rectangular Robust Markov Decision Processes},
  author = {Navdeep Kumar and Esther Derman and Matthieu Geist and Kfir Levy and Shie Mannor},
  journal= {arXiv preprint arXiv:2301.13589},
  year   = {2023}
}

Comments

Accepted to NeurIPS 2023