Related papers: Off-policy Confidence Sequences

PAC Off-Policy Prediction of Contextual Bandits

This paper investigates off-policy evaluation in contextual bandits, aiming to quantify the performance of a target policy using data collected under a different and potentially unknown behavior policy. Recently, methods based on conformal…

Machine Learning · Statistics 2025-07-23 Yilong Wan , Yuqiang Li , Xianyi Wu

Anytime-valid off-policy inference for contextual bandits

Contextual bandit algorithms are ubiquitous tools for active sequential experimentation in healthcare and the tech industry. They involve online learning algorithms that adaptively learn policies over time to map observed contexts $X_t$ to…

Methodology · Statistics 2024-08-19 Ian Waudby-Smith , Lili Wu , Aaditya Ramdas , Nikos Karampatziakis , Paul Mineiro

PAC-Bayesian Offline Contextual Bandits With Guarantees

This paper introduces a new principled approach for off-policy learning in contextual bandits. Unlike previous work, our approach does not derive learning principles from intractable or loose bounds. We analyse the problem through the…

Machine Learning · Statistics 2023-05-30 Otmane Sakhi , Pierre Alquier , Nicolas Chopin

Empirical Likelihood for Contextual Bandits

We propose an estimator and confidence interval for computing the value of a policy from off-policy data in the contextual bandit setting. To this end we apply empirical likelihood techniques to formulate our estimator and confidence…

Machine Learning · Computer Science 2020-10-20 Nikos Karampatziakis , John Langford , Paul Mineiro

Conformal Off-Policy Prediction in Contextual Bandits

Most off-policy evaluation methods for contextual bandits have focused on the expected outcome of a policy, which is estimated via methods that at best provide only asymptotic guarantees. However, in many applications, the expectation may…

Machine Learning · Statistics 2022-10-27 Muhammad Faaiz Taufiq , Jean-Francois Ton , Rob Cornish , Yee Whye Teh , Arnaud Doucet

Improved Offline Contextual Bandits with Second-Order Bounds: Betting and Freezing

We consider off-policy selection and learning in contextual bandits, where the learner aims to select or train a reward-maximizing policy using data collected by a fixed behavior policy. Our contribution is two-fold. First, we propose a…

Machine Learning · Computer Science 2025-07-15 J. Jon Ryu , Jeongyeol Kwon , Benjamin Koppe , Kwang-Sung Jun

Cramming Contextual Bandits for On-policy Statistical Evaluation

We introduce the cram method as a general statistical framework for evaluating the final learned policy from a multi-armed contextual bandit algorithm, using the dataset generated by the same bandit algorithm. The proposed on-policy…

Machine Learning · Computer Science 2025-04-16 Zeyang Jia , Kosuke Imai , Michael Lingzhi Li

Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under Batch Update Policy

The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data obtained via a behavior policy. However, because the contextual bandit algorithm updates the policy based on past observations, the samples are not…

Machine Learning · Computer Science 2020-10-27 Masahiro Kato , Yusuke Kaneko

Confidence Interval for Off-Policy Evaluation from Dependent Samples via Bandit Algorithm: Approach from Standardized Martingales

This study addresses the problem of off-policy evaluation (OPE) from dependent samples obtained via the bandit algorithm. The goal of OPE is to evaluate a new policy using historical data obtained from behavior policies generated by the…

Machine Learning · Statistics 2020-06-15 Masahiro Kato

On Learning to Rank Long Sequences with Contextual Bandits

Motivated by problems of learning to rank long item sequences, we introduce a variant of the cascading bandit model that considers flexible length sequences with varying rewards and losses. We formulate two generative models for this…

Machine Learning · Computer Science 2022-09-05 Anirban Santara , Claudio Gentile , Gaurav Aggarwal , Shuai Li

Off-policy Bandits with Deficient Support

Learning effective contextual-bandit policies from past actions of a deployed system is highly desirable in many settings (e.g. voice assistants, recommendation, search), since it enables the reuse of large amounts of log data.…

Machine Learning · Computer Science 2020-06-18 Noveen Sachdeva , Yi Su , Thorsten Joachims

Confident Off-Policy Evaluation and Selection through Self-Normalized Importance Weighting

We consider off-policy evaluation in the contextual bandit setting for the purpose of obtaining a robust off-policy selection strategy, where the selection strategy is evaluated based on the value of the chosen policy in a set of proposal…

Machine Learning · Computer Science 2022-03-22 Ilja Kuzborskij , Claire Vernade , András György , Csaba Szepesvári

A Contextual Bandit Bake-off

Contextual bandit algorithms are essential for solving many real-world interactive machine learning problems. Despite multiple recent successes on statistically and computationally efficient methods, the practical behavior of these…

Machine Learning · Statistics 2021-06-08 Alberto Bietti , Alekh Agarwal , John Langford

Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

We study the off-policy evaluation problem---estimating the value of a target policy using data collected by another policy---under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model…

Machine Learning · Statistics 2017-11-15 Yu-Xiang Wang , Alekh Agarwal , Miroslav Dudik

Non-Stationary Off-Policy Optimization

Off-policy learning is a framework for evaluating and optimizing policies without deploying them, from data collected by another policy. Real-world environments are typically non-stationary and the offline learned policies should adapt to…

Machine Learning · Computer Science 2021-04-06 Joey Hong , Branislav Kveton , Manzil Zaheer , Yinlam Chow , Amr Ahmed

Online Learning with Off-Policy Feedback

We study the problem of online learning in adversarial bandit problems under a partial observability model called off-policy feedback. In this sequential decision making problem, the learner cannot directly observe its rewards, but instead…

Machine Learning · Computer Science 2022-07-20 Germano Gabbianelli , Matteo Papini , Gergely Neu

Bandits with Partially Observable Confounded Data

We study linear contextual bandits with access to a large, confounded, offline dataset that was sampled from some fixed policy. We show that this problem is closely related to a variant of the bandit problem with side information. We…

Machine Learning · Computer Science 2021-08-11 Guy Tennenholtz , Uri Shalit , Shie Mannor , Yonathan Efroni

A Practical Guide of Off-Policy Evaluation for Bandit Problems

Off-policy evaluation (OPE) is the problem of estimating the value of a target policy from samples obtained via different policies. Recently, applying OPE methods for bandit problems has garnered attention. For the theoretical guarantees of…

Machine Learning · Computer Science 2020-10-26 Masahiro Kato , Kenshi Abe , Kaito Ariu , Shota Yasui

Contextual Bandits for Evaluating and Improving Inventory Control Policies

Solutions to address the periodic review inventory control problem with nonstationary random demand, lost sales, and stochastic vendor lead times typically involve making strong assumptions on the dynamics for either approximation or…

Machine Learning · Statistics 2023-10-26 Dean Foster , Randy Jia , Dhruv Madeka

On Minimax Optimal Offline Policy Evaluation

This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy. We first consider the multi-armed bandit case, establish a minimax…

Artificial Intelligence · Computer Science 2014-09-15 Lihong Li , Remi Munos , Csaba Szepesvari