English

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

Machine Learning 2020-02-25 v1 Statistics Theory Machine Learning Statistics Theory

Abstract

This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history generated by unknown behavioral policies. We study a regression-based fitted Q iteration method, and show that it is equivalent to a model-based method that estimates a conditional mean embedding of the transition operator. We prove that this method is information-theoretically optimal and has nearly minimal estimation error. In particular, by leveraging contraction property of Markov processes and martingale concentration, we establish a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound. The policy evaluation error depends sharply on a restricted χ2\chi^2-divergence over the function class between the long-term distribution of the target policy and the distribution of past data. This restricted χ2\chi^2-divergence is both instance-dependent and function-class-dependent. It characterizes the statistical limit of off-policy evaluation. Further, we provide an easily computable confidence bound for the policy evaluator, which may be useful for optimistic planning and safe policy improvement.

Keywords

Cite

@article{arxiv.2002.09516,
  title  = {Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation},
  author = {Yaqi Duan and Mengdi Wang},
  journal= {arXiv preprint arXiv:2002.09516},
  year   = {2020}
}