English

Sequential Decision Problems with Missing Feedback

Econometrics 2025-07-29 v1

Abstract

This paper investigates the challenges of optimal online policy learning under missing data. State-of-the-art algorithms implicitly assume that rewards are always observable. I show that when rewards are missing at random, the Upper Confidence Bound (UCB) algorithm maintains optimal regret bounds; however, it selects suboptimal policies with high probability as soon as this assumption is relaxed. To overcome this limitation, I introduce a fully nonparametric algorithm-Doubly-Robust Upper Confidence Bound (DR-UCB)-which explicitly models the form of missingness through observable covariates and achieves a nearly-optimal worst-case regret rate of O~(T)\widetilde{O}(\sqrt{T}). To prove this result, I derive high-probability bounds for a class of doubly-robust estimators that hold under broad dependence structures. Simulation results closely match the theoretical predictions, validating the proposed framework.

Keywords

Cite

@article{arxiv.2507.19596,
  title  = {Sequential Decision Problems with Missing Feedback},
  author = {Filippo Palomba},
  journal= {arXiv preprint arXiv:2507.19596},
  year   = {2025}
}