Related papers: Minimax Model Learning

Off-policy Reinforcement Learning with Optimistic Exploration and Distribution Correction

Improving the sample efficiency of reinforcement learning algorithms requires effective exploration. Following the principle of $\textit{optimism in the face of uncertainty}$ (OFU), we train a separate exploration policy to maximize the…

Machine Learning · Computer Science 2022-11-23 Jiachen Li , Shuo Cheng , Zhenyu Liao , Huayan Wang , William Yang Wang , Qinxun Bai

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history…

Machine Learning · Computer Science 2020-02-25 Yaqi Duan , Mengdi Wang

Off-Policy Learning to Reason Works Because It Is More Pessimistic Than You Think

Large scale reinforcement learning has become a central tool for improving reasoning in large language models. At this scale, generation is often lagged or asynchronous, so updates are performed on data collected by older policies. This…

Machine Learning · Computer Science 2026-05-28 Otmane Sakhi , Aleksei Arzhantsev , Imad Aouali , Flavian Vasile

Zero-Shot Off-Policy Learning

Off-policy learning methods seek to derive an optimal policy directly from a fixed dataset of prior interactions. This objective presents significant challenges, primarily due to the inherent distributional shift and value function…

Machine Learning · Computer Science 2026-02-03 Arip Asadulaev , Maksim Bobrin , Salem Lahlou , Dmitry Dylov , Fakhri Karray , Martin Takac

Model-based Offline Reinforcement Learning with Local Misspecification

We present a model-based offline reinforcement learning policy performance lower bound that explicitly captures dynamics model misspecification and distribution mismatch and we propose an empirical algorithm for optimal offline policy…

Machine Learning · Computer Science 2023-01-30 Kefan Dong , Yannis Flet-Berliac , Allen Nie , Emma Brunskill

Off-Policy Imitation Learning from Observations

Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit through the reuse of incomplete resources. Compared to conventional imitation learning (IL), LfO is more challenging…

Machine Learning · Computer Science 2021-03-01 Zhuangdi Zhu , Kaixiang Lin , Bo Dai , Jiayu Zhou

Evaluation-Time Policy Switching for Offline Reinforcement Learning

Offline reinforcement learning (RL) looks at learning how to optimally solve tasks using a fixed dataset of interactions from the environment. Many off-policy algorithms developed for online learning struggle in the offline setting as they…

Machine Learning · Computer Science 2025-03-18 Natinael Solomon Neggatu , Jeremie Houssineau , Giovanni Montana

Semi-Parametric Efficient Policy Learning with Continuous Actions

We consider off-policy evaluation and optimization with continuous action spaces. We focus on observational data where the data collection policy is unknown and needs to be estimated. We take a semi-parametric approach where the value…

Econometrics · Economics 2019-07-23 Mert Demirer , Vasilis Syrgkanis , Greg Lewis , Victor Chernozhukov

Off-Policy Reinforcement Learning with Loss Function Weighted by Temporal Difference Error

Training agents via off-policy deep reinforcement learning (RL) requires a large memory, named replay memory, that stores past experiences used for learning. These experiences are sampled, uniformly or non-uniformly, to create the batches…

Machine Learning · Computer Science 2022-12-27 Bumgeun Park , Taeyoung Kim , Woohyeon Moon , Luiz Felipe Vecchietti , Dongsoo Har

Reliable Off-policy Evaluation for Reinforcement Learning

In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy using logged trajectory data generated from a different behavior policy, without execution of the target policy.…

Machine Learning · Computer Science 2022-11-04 Jie Wang , Rui Gao , Hongyuan Zha

Bi-Level Offline Policy Optimization with Limited Exploration

We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset. A fundamental challenge behind this task is the distributional shift due to the dataset lacking sufficient exploration,…

Machine Learning · Computer Science 2023-10-11 Wenzhuo Zhou

MOBODY: Model Based Off-Dynamics Offline Reinforcement Learning

We study off-dynamics offline reinforcement learning, where the goal is to learn a policy from offline source and limited target datasets with mismatched dynamics. Existing methods either penalize the reward or discard source transitions…

Machine Learning · Computer Science 2026-03-19 Yihong Guo , Yu Yang , Pan Xu , Anqi Liu

RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization

Aligning large language models (LLMs) on domain-specific data remains a fundamental challenge. Supervised fine-tuning (SFT) offers a straightforward way to inject domain knowledge but often degrades the model's generality. In contrast,…

Machine Learning · Computer Science 2026-02-12 Linxuan Xia , Xiaolong Yang , Yongyuan Chen , Enyue Zhao , Deng Cai , Yasheng Wang , Boxi Wu

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

We provide theoretical investigations into off-policy evaluation in reinforcement learning using function approximators for (marginalized) importance weights and value functions. Our contributions include: (1) A new estimator, MWL, that…

Machine Learning · Computer Science 2020-10-08 Masatoshi Uehara , Jiawei Huang , Nan Jiang

Chaining Value Functions for Off-Policy Learning

To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn `off-policy' about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or…

Machine Learning · Computer Science 2022-02-03 Simon Schmitt , John Shawe-Taylor , Hado van Hasselt

Bidirectional Model-based Policy Optimization

Model-based reinforcement learning approaches leverage a forward dynamics model to support planning and decision making, which, however, may fail catastrophically if the model is inaccurate. Although there are several existing methods…

Machine Learning · Computer Science 2020-09-30 Hang Lai , Jian Shen , Weinan Zhang , Yong Yu

Pessimistic Off-Policy Optimization for Learning to Rank

Off-policy learning is a framework for optimizing policies without deploying them, using data collected by another policy. In recommender systems, this is especially challenging due to the imbalance in logged data: some items are…

Machine Learning · Computer Science 2024-10-23 Matej Cief , Branislav Kveton , Michal Kompan

Off-Policy Policy Gradient Algorithms by Constraining the State Distribution Shift

Off-policy deep reinforcement learning (RL) algorithms are incapable of learning solely from batch offline data without online interactions with the environment, due to the phenomenon known as \textit{extrapolation error}. This is often due…

Machine Learning · Computer Science 2019-12-03 Riashat Islam , Komal K. Teru , Deepak Sharma , Joelle Pineau

LLMs Can Learn to Reason Via Off-Policy RL

Reinforcement learning (RL) approaches for Large Language Models (LLMs) frequently use on-policy algorithms, such as PPO or GRPO. However, policy lag from distributed training architectures and differences between the training and inference…

Machine Learning · Computer Science 2026-03-03 Daniel Ritter , Owen Oertell , Bradley Guo , Jonathan Chang , Kianté Brantley , Wen Sun

Combining Parametric and Nonparametric Models for Off-Policy Evaluation

We consider a model-based approach to perform batch off-policy evaluation in reinforcement learning. Our method takes a mixture-of-experts approach to combine parametric and non-parametric models of the environment such that the final value…

Machine Learning · Computer Science 2020-02-19 Omer Gottesman , Yao Liu , Scott Sussex , Emma Brunskill , Finale Doshi-Velez