Related papers: Off-policy Learning for Multiple Loggers

Off-Policy Evaluation from Logged Human Feedback

Learning from human feedback has been central to recent advances in artificial intelligence and machine learning. Since the collection of human feedback is costly, a natural question to ask is if the new feedback always needs to collected.…

Machine Learning · Computer Science 2024-06-17 Aniruddha Bhargava , Lalit Jain , Branislav Kveton , Ge Liu , Subhojyoti Mukherjee

Pessimistic Off-Policy Optimization for Learning to Rank

Off-policy learning is a framework for optimizing policies without deploying them, using data collected by another policy. In recommender systems, this is especially challenging due to the imbalance in logged data: some items are…

Machine Learning · Computer Science 2024-10-23 Matej Cief , Branislav Kveton , Michal Kompan

Semi-supervised Batch Learning From Logged Data

Off-policy learning methods are intended to learn a policy from logged data, which includes context, action, and feedback (cost or reward) for each sample point. In this work, we build on the counterfactual risk minimization framework,…

Machine Learning · Computer Science 2024-02-20 Gholamali Aminian , Armin Behnamnia , Roberto Vega , Laura Toni , Chengchun Shi , Hamid R. Rabiee , Omar Rivasplata , Miguel R. D. Rodrigues

Offline Multi-Action Policy Learning: Generalization and Optimization

In many settings, a decision-maker wishes to learn a rule, or policy, that maps from observable characteristics of an individual to an action. Examples include selecting offers, prices, advertisements, or emails to send to consumers, as…

Machine Learning · Statistics 2018-11-20 Zhengyuan Zhou , Susan Athey , Stefan Wager

Off-Policy Optimization of Portfolio Allocation Policies under Constraints

The dynamic portfolio optimization problem in finance frequently requires learning policies that adhere to various constraints, driven by investor preferences and risk. We motivate this problem of finding an allocation policy within a…

Artificial Intelligence · Computer Science 2020-12-23 Nymisha Bandi , Theja Tulabandhula

Reliable Off-policy Evaluation for Reinforcement Learning

In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy using logged trajectory data generated from a different behavior policy, without execution of the target policy.…

Machine Learning · Computer Science 2022-11-04 Jie Wang , Rui Gao , Hongyuan Zha

Effective Evaluation using Logged Bandit Feedback from Multiple Loggers

Accurately evaluating new policies (e.g. ad-placement models, ranking functions, recommendation functions) is one of the key prerequisites for improving interactive systems. While the conventional approach to evaluation relies on online A/B…

Machine Learning · Computer Science 2017-06-27 Aman Agarwal , Soumya Basu , Tobias Schnabel , Thorsten Joachims

Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning

Many reinforcement learning algorithms, particularly those that rely on return estimates for policy improvement, can suffer from poor sample efficiency and training instability due to high-variance return estimates. In this paper we…

Machine Learning · Computer Science 2026-01-06 Alexander W. Goodall , Edwin Hamel-De le Court , Francesco Belardinelli

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

In this paper we present a new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy. The ability to evaluate a policy from historical data is important…

Machine Learning · Computer Science 2016-04-05 Philip S. Thomas , Emma Brunskill

Non-Stationary Off-Policy Optimization

Off-policy learning is a framework for evaluating and optimizing policies without deploying them, from data collected by another policy. Real-world environments are typically non-stationary and the offline learned policies should adapt to…

Machine Learning · Computer Science 2021-04-06 Joey Hong , Branislav Kveton , Manzil Zaheer , Yinlam Chow , Amr Ahmed

Off-Policy Learning to Reason Works Because It Is More Pessimistic Than You Think

Large scale reinforcement learning has become a central tool for improving reasoning in large language models. At this scale, generation is often lagged or asynchronous, so updates are performed on data collected by older policies. This…

Machine Learning · Computer Science 2026-05-28 Otmane Sakhi , Aleksei Arzhantsev , Imad Aouali , Flavian Vasile

Hyperparameter Optimization Can Even be Harmful in Off-Policy Learning and How to Deal with It

There has been a growing interest in off-policy evaluation in the literature such as recommender systems and personalized medicine. We have so far seen significant progress in developing estimators aimed at accurately estimating the…

Machine Learning · Computer Science 2024-04-24 Yuta Saito , Masahiro Nomura

Chaining Value Functions for Off-Policy Learning

To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn `off-policy' about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or…

Machine Learning · Computer Science 2022-02-03 Simon Schmitt , John Shawe-Taylor , Hado van Hasselt

Online Off-policy Prediction

This paper investigates the problem of online prediction learning, where learning proceeds continuously as the agent interacts with an environment. The predictions made by the agent are contingent on a particular way of behaving,…

Machine Learning · Computer Science 2018-11-08 Sina Ghiassian , Andrew Patterson , Martha White , Richard S. Sutton , Adam White

On Minimax Optimal Offline Policy Evaluation

This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy. We first consider the multi-armed bandit case, establish a minimax…

Artificial Intelligence · Computer Science 2014-09-15 Lihong Li , Remi Munos , Csaba Szepesvari

Boosted Off-Policy Learning

We propose the first boosting algorithm for off-policy learning from logged bandit feedback. Unlike existing boosting methods for supervised learning, our algorithm directly optimizes an estimate of the policy's expected reward. We analyze…

Machine Learning · Computer Science 2023-05-03 Ben London , Levi Lu , Ted Sandler , Thorsten Joachims

Large-scale Validation of Counterfactual Learning Methods: A Test-Bed

The ability to perform effective off-policy learning would revolutionize the process of building better interactive systems, such as search engines and recommendation systems for e-commerce, computational advertising and news. Recent…

Machine Learning · Computer Science 2017-06-27 Damien Lefortier , Adith Swaminathan , Xiaotao Gu , Thorsten Joachims , Maarten de Rijke

Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step Q-learning: A Novel Correction Approach

Compared to on-policy counterparts, off-policy model-free deep reinforcement learning can improve data efficiency by repeatedly using the previously gathered data. However, off-policy learning becomes challenging when the discrepancy…

Machine Learning · Computer Science 2023-09-27 Baturay Saglam , Dogan C. Cicek , Furkan B. Mutlu , Suleyman S. Kozat

Improving the Efficiency of Off-Policy Reinforcement Learning by Accounting for Past Decisions

Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, particularly in the experience replay setting now commonly used with deep neural networks. Classically, off-policy estimation bias is…

Machine Learning · Computer Science 2021-12-24 Brett Daley , Christopher Amato

More Efficient Off-Policy Evaluation through Regularized Targeted Learning

We study the problem of off-policy evaluation (OPE) in Reinforcement Learning (RL), where the aim is to estimate the performance of a new policy given historical data that may have been generated by a different policy, or policies. In…

Machine Learning · Computer Science 2019-12-16 Aurélien F. Bibaut , Ivana Malenica , Nikos Vlassis , Mark J. van der Laan