Related papers: Off-policy evaluation for MDPs with unknown struct…

Chaining Value Functions for Off-Policy Learning

To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn `off-policy' about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or…

Machine Learning · Computer Science 2022-02-03 Simon Schmitt , John Shawe-Taylor , Hado van Hasselt

Generalized Proximal Policy Optimization with Sample Reuse

In real-world decision making tasks, it is critical for data-driven reinforcement learning methods to be both stable and sample efficient. On-policy methods typically generate reliable policy improvement throughout training, while…

Machine Learning · Computer Science 2021-11-02 James Queeney , Ioannis Ch. Paschalidis , Christos G. Cassandras

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

In this paper we present a new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy. The ability to evaluate a policy from historical data is important…

Machine Learning · Computer Science 2016-04-05 Philip S. Thomas , Emma Brunskill

Off-Policy Evaluation with Out-of-Sample Guarantees

We consider the problem of evaluating the performance of a decision policy using past observational data. The outcome of a policy is measured in terms of a loss (aka. disutility or negative reward) and the main problem is making valid…

Machine Learning · Statistics 2023-07-03 Sofia Ek , Dave Zachariah , Fredrik D. Johansson , Petre Stoica

Multi-step Off-policy Learning Without Importance Sampling Ratios

To estimate the value functions of policies from exploratory data, most model-free off-policy algorithms rely on importance sampling, where the use of importance sampling ratios often leads to estimates with severe variance. It is thus…

Machine Learning · Computer Science 2017-02-13 Ashique Rupam Mahmood , Huizhen Yu , Richard S. Sutton

Off-policy Learning with Eligibility Traces: A Survey

In the framework of Markov Decision Processes, off-policy learning, that is the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy. We briefly…

Artificial Intelligence · Computer Science 2013-04-16 Matthieu Geist , Bruno Scherrer

Off-Policy Evaluation via Off-Policy Classification

In this work, we consider the problem of model selection for deep reinforcement learning (RL) in real-world environments. Typically, the performance of deep RL algorithms is evaluated via on-policy interactions with the target environment.…

Machine Learning · Computer Science 2019-11-26 Alex Irpan , Kanishka Rao , Konstantinos Bousmalis , Chris Harris , Julian Ibarz , Sergey Levine

Off-Policy Evaluation for Action-Dependent Non-Stationary Environments

Methods for sequential decision-making are often built upon a foundational assumption that the underlying decision process is stationary. This limits the application of such methods because real-world problems are often subject to changes…

Machine Learning · Computer Science 2023-01-26 Yash Chandak , Shiv Shankar , Nathaniel D. Bastian , Bruno Castro da Silva , Emma Brunskil , Philip S. Thomas

Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under Batch Update Policy

The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data obtained via a behavior policy. However, because the contextual bandit algorithm updates the policy based on past observations, the samples are not…

Machine Learning · Computer Science 2020-10-27 Masahiro Kato , Yusuke Kaneko

Distillation Policy Optimization

While on-policy algorithms are known for their stability, they often demand a substantial number of samples. In contrast, off-policy algorithms, which leverage past experiences, are considered sample-efficient but tend to exhibit…

Machine Learning · Computer Science 2023-09-28 Jianfei Ma

Pessimistic Off-Policy Optimization for Learning to Rank

Off-policy learning is a framework for optimizing policies without deploying them, using data collected by another policy. In recommender systems, this is especially challenging due to the imbalance in logged data: some items are…

Machine Learning · Computer Science 2024-10-23 Matej Cief , Branislav Kveton , Michal Kompan

Scaling Life-long Off-policy Learning

We pursue a life-long learning approach to artificial intelligence that makes extensive use of reinforcement learning algorithms. We build on our prior work with general value functions (GVFs) and the Horde architecture. GVFs have been…

Artificial Intelligence · Computer Science 2012-06-28 Adam White , Joseph Modayil , Richard S. Sutton

Fair Off-Policy Learning from Observational Data

Algorithmic decision-making in practice must be fair for legal, ethical, and societal reasons. To achieve this, prior research has contributed various approaches that ensure fairness in machine learning predictions, while comparatively…

Machine Learning · Computer Science 2023-10-10 Dennis Frauen , Valentyn Melnychuk , Stefan Feuerriegel

When Do Off-Policy and On-Policy Policy Gradient Methods Align?

Policy gradient methods are widely adopted reinforcement learning algorithms for tasks with continuous action spaces. These methods succeeded in many application domains, however, because of their notorious sample inefficiency their use…

Machine Learning · Statistics 2024-02-20 Davide Mambelli , Stephan Bongers , Onno Zoeter , Matthijs T. J. Spaan , Frans A. Oliehoek

Non-Stationary Off-Policy Optimization

Off-policy learning is a framework for evaluating and optimizing policies without deploying them, from data collected by another policy. Real-world environments are typically non-stationary and the offline learned policies should adapt to…

Machine Learning · Computer Science 2021-04-06 Joey Hong , Branislav Kveton , Manzil Zaheer , Yinlam Chow , Amr Ahmed

Supervised Off-Policy Ranking

Off-policy evaluation (OPE) is to evaluate a target policy with data generated by other policies. Most previous OPE methods focus on precisely estimating the true performance of a policy. We observe that in many applications, (1) the end…

Machine Learning · Computer Science 2022-06-22 Yue Jin , Yue Zhang , Tao Qin , Xudong Zhang , Jian Yuan , Houqiang Li , Tie-Yan Liu

Benchmarks for Deep Off-Policy Evaluation

Off-policy evaluation (OPE) holds the promise of being able to leverage large, offline datasets for both evaluating and selecting complex policies for decision making. The ability to learn offline is particularly important in many…

Machine Learning · Computer Science 2021-04-01 Justin Fu , Mohammad Norouzi , Ofir Nachum , George Tucker , Ziyu Wang , Alexander Novikov , Mengjiao Yang , Michael R. Zhang , Yutian Chen , Aviral Kumar , Cosmin Paduraru , Sergey Levine , Tom Le Paine

Off-Policy Evaluation in Partially Observable Environments

This work studies the problem of batch off-policy evaluation for Reinforcement Learning in partially observable environments. Off-policy evaluation under partial observability is inherently prone to bias, with risk of arbitrarily large…

Machine Learning · Computer Science 2019-11-26 Guy Tennenholtz , Shie Mannor , Uri Shalit

Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation

Evaluating a policy by deploying it in the real world can be risky and costly. Off-policy policy evaluation (OPE) algorithms use historical data collected from running a previous policy to evaluate a new policy, which provides a means for…

Artificial Intelligence · Computer Science 2017-12-07 Zhaohan Daniel Guo , Philip S. Thomas , Emma Brunskill

Reliable Off-policy Evaluation for Reinforcement Learning

In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy using logged trajectory data generated from a different behavior policy, without execution of the target policy.…

Machine Learning · Computer Science 2022-11-04 Jie Wang , Rui Gao , Hongyuan Zha