Related papers: Off-Belief Learning

Zero-Shot Off-Policy Learning

Off-policy learning methods seek to derive an optimal policy directly from a fixed dataset of prior interactions. This objective presents significant challenges, primarily due to the inherent distributional shift and value function…

Machine Learning · Computer Science 2026-02-03 Arip Asadulaev , Maksim Bobrin , Salem Lahlou , Dmitry Dylov , Fakhri Karray , Martin Takac

Chaining Value Functions for Off-Policy Learning

To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn `off-policy' about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or…

Machine Learning · Computer Science 2022-02-03 Simon Schmitt , John Shawe-Taylor , Hado van Hasselt

Pessimistic Off-Policy Optimization for Learning to Rank

Off-policy learning is a framework for optimizing policies without deploying them, using data collected by another policy. In recommender systems, this is especially challenging due to the imbalance in logged data: some items are…

Machine Learning · Computer Science 2024-10-23 Matej Cief , Branislav Kveton , Michal Kompan

Offline Meta Learning of Exploration

Consider the following instance of the Offline Meta Reinforcement Learning (OMRL) problem: given the complete training logs of $N$ conventional RL agents, trained on $N$ different tasks, design a meta-agent that can quickly maximize reward…

Machine Learning · Computer Science 2021-02-15 Ron Dorfman , Idan Shenfeld , Aviv Tamar

Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline

Policy-based methods have achieved remarkable success in solving challenging reinforcement learning problems. Among these methods, off-policy policy gradient methods are particularly important due to that they can benefit from off-policy…

Machine Learning · Computer Science 2024-05-07 Wenjia Meng , Qian Zheng , Long Yang , Yilong Yin , Gang Pan

Model-Based Offline Reinforcement Learning with Pessimism-Modulated Dynamics Belief

Model-based offline reinforcement learning (RL) aims to find highly rewarding policy, by leveraging a previously collected static dataset and a dynamics model. While the dynamics model learned through reuse of the static dataset, its…

Machine Learning · Computer Science 2022-11-01 Kaiyang Guo , Yunfeng Shao , Yanhui Geng

Offline Reinforcement Learning for Human-Guided Human-Machine Interaction with Private Information

Motivated by the human-machine interaction such as training chatbots for improving customer satisfaction, we study human-guided human-machine interaction involving private information. We model this interaction as a two-player turn-based…

Machine Learning · Statistics 2022-12-26 Zuyue Fu , Zhengling Qi , Zhuoran Yang , Zhaoran Wang , Lan Wang

Off-policy Reinforcement Learning with Optimistic Exploration and Distribution Correction

Improving the sample efficiency of reinforcement learning algorithms requires effective exploration. Following the principle of $\textit{optimism in the face of uncertainty}$ (OFU), we train a separate exploration policy to maximize the…

Machine Learning · Computer Science 2022-11-23 Jiachen Li , Shuo Cheng , Zhenyu Liao , Huayan Wang , William Yang Wang , Qinxun Bai

Evaluation-Time Policy Switching for Offline Reinforcement Learning

Offline reinforcement learning (RL) looks at learning how to optimally solve tasks using a fixed dataset of interactions from the environment. Many off-policy algorithms developed for online learning struggle in the offline setting as they…

Machine Learning · Computer Science 2025-03-18 Natinael Solomon Neggatu , Jeremie Houssineau , Giovanni Montana

Optimal Policy Learning with Observational Data in Multi-Action Scenarios: Estimation, Risk Preference, and Potential Failures

This paper deals with optimal policy learning (OPL) with observational data, i.e. data-driven optimal decision-making, in multi-action (or multi-arm) settings, where a finite set of decision options is available. It is organized in three…

Machine Learning · Statistics 2024-04-01 Giovanni Cerulli

Non-Stationary Off-Policy Optimization

Off-policy learning is a framework for evaluating and optimizing policies without deploying them, from data collected by another policy. Real-world environments are typically non-stationary and the offline learned policies should adapt to…

Machine Learning · Computer Science 2021-04-06 Joey Hong , Branislav Kveton , Manzil Zaheer , Yinlam Chow , Amr Ahmed

Off-Policy Learning to Reason Works Because It Is More Pessimistic Than You Think

Large scale reinforcement learning has become a central tool for improving reasoning in large language models. At this scale, generation is often lagged or asynchronous, so updates are performed on data collected by older policies. This…

Machine Learning · Computer Science 2026-05-28 Otmane Sakhi , Aleksei Arzhantsev , Imad Aouali , Flavian Vasile

Off-Policy Evaluation in Partially Observable Environments

This work studies the problem of batch off-policy evaluation for Reinforcement Learning in partially observable environments. Off-policy evaluation under partial observability is inherently prone to bias, with risk of arbitrarily large…

Machine Learning · Computer Science 2019-11-26 Guy Tennenholtz , Shie Mannor , Uri Shalit

LLMs Can Learn to Reason Via Off-Policy RL

Reinforcement learning (RL) approaches for Large Language Models (LLMs) frequently use on-policy algorithms, such as PPO or GRPO. However, policy lag from distributed training architectures and differences between the training and inference…

Machine Learning · Computer Science 2026-03-03 Daniel Ritter , Owen Oertell , Bradley Guo , Jonathan Chang , Kianté Brantley , Wen Sun

Off-Policy Learning in Large Action Spaces: Optimization Matters More Than Estimation

Off-policy evaluation (OPE) and off-policy learning (OPL) are foundational for decision-making in offline contextual bandits. Recent advances in OPL primarily optimize OPE estimators with improved statistical properties, assuming that…

Machine Learning · Statistics 2025-09-04 Imad Aouali , Otmane Sakhi

Online Off-policy Prediction

This paper investigates the problem of online prediction learning, where learning proceeds continuously as the agent interacts with an environment. The predictions made by the agent are contingent on a particular way of behaving,…

Machine Learning · Computer Science 2018-11-08 Sina Ghiassian , Andrew Patterson , Martha White , Richard S. Sutton , Adam White

POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy Decomposition

We study off-policy learning (OPL) of contextual bandit policies in large discrete action spaces where existing methods -- most of which rely crucially on reward-regression models or importance-weighted policy gradients -- fail due to…

Machine Learning · Statistics 2024-02-12 Yuta Saito , Jihan Yao , Thorsten Joachims

A General Framework for Off-Policy Learning with Partially-Observed Reward

Off-policy learning (OPL) in contextual bandits aims to learn a decision-making policy that maximizes the target rewards by using only historical interaction data collected under previously developed policies. Unfortunately, when rewards…

Machine Learning · Computer Science 2025-06-18 Rikiya Takehi , Masahiro Asami , Kosuke Kawakami , Yuta Saito

Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline Reinforcement Learning

A key problem in off-policy Reinforcement Learning (RL) is the mismatch, or distribution shift, between the dataset and the distribution over states and actions visited by the learned policy. This problem is exacerbated in the fully offline…

Machine Learning · Computer Science 2023-11-28 Melrose Roderick , Gaurav Manek , Felix Berkenkamp , J. Zico Kolter

Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality

This paper studies offline policy learning, which aims at utilizing observations collected a priori (from either fixed or adaptively evolving behavior policies) to learn an optimal individualized decision rule that achieves the best overall…

Machine Learning · Computer Science 2025-06-06 Ying Jin , Zhimei Ren , Zhuoran Yang , Zhaoran Wang