Related papers: Offline Policy Optimization with Eligible Actions

Importance-Weighted Offline Learning Done Right

We study the problem of offline policy optimization in stochastic contextual bandit problems, where the goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy. Rather than making…

Machine Learning · Computer Science 2023-09-28 Germano Gabbianelli , Gergely Neu , Matteo Papini

Offline Policy Learning with Weight Clipping and Heaviside Composite Optimization

Offline policy learning aims to use historical data to learn an optimal personalized decision rule. In the standard estimate-then-optimize framework, reweighting-based methods (e.g., inverse propensity weighting or doubly robust estimators)…

Optimization and Control · Mathematics 2026-01-21 Jingren Liu , Hanzhang Qin , Junyi Liu , Mabel C. Chou , Jong-Shi Pang

Value-aware Importance Weighting for Off-policy Reinforcement Learning

Importance sampling is a central idea underlying off-policy prediction in reinforcement learning. It provides a strategy for re-weighting samples from a distribution to obtain unbiased estimates under another distribution. However,…

Machine Learning · Computer Science 2023-06-28 Kristopher De Asis , Eric Graves , Richard S. Sutton

Low Variance Off-policy Evaluation with State-based Importance Sampling

In many domains, the exploration process of reinforcement learning will be too costly as it requires trying out suboptimal policies, resulting in a need for off-policy evaluation, in which a target policy is evaluated based on data…

Machine Learning · Computer Science 2024-05-07 David M. Bossens , Philip S. Thomas

Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced Datasets

Offline policy learning is aimed at learning decision-making policies using existing datasets of trajectories without collecting additional data. The primary motivation for using reinforcement learning (RL) instead of supervised learning…

Machine Learning · Computer Science 2023-10-13 Zhang-Wei Hong , Aviral Kumar , Sathwik Karnik , Abhishek Bhandwaldar , Akash Srivastava , Joni Pajarinen , Romain Laroche , Abhishek Gupta , Pulkit Agrawal

Importance Sampling Policy Evaluation with an Estimated Behavior Policy

We consider the problem of off-policy evaluation in Markov decision processes. Off-policy evaluation is the task of evaluating the expected return of one policy with data generated by a different, behavior policy. Importance sampling is a…

Machine Learning · Computer Science 2019-05-13 Josiah P. Hanna , Scott Niekum , Peter Stone

OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators

Offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance by leveraging historical interaction data collected from other policies. Evaluating a new policy online without a…

Machine Learning · Computer Science 2024-11-04 Allen Nie , Yash Chandak , Christina J. Yuan , Anirudhan Badrinath , Yannis Flet-Berliac , Emma Brunskil

Evaluation-Time Policy Switching for Offline Reinforcement Learning

Offline reinforcement learning (RL) looks at learning how to optimally solve tasks using a fixed dataset of interactions from the environment. Many off-policy algorithms developed for online learning struggle in the offline setting as they…

Machine Learning · Computer Science 2025-03-18 Natinael Solomon Neggatu , Jeremie Houssineau , Giovanni Montana

Offline Policy Evaluation and Optimization under Confounding

Evaluating and optimizing policies in the presence of unobserved confounders is a problem of growing interest in offline reinforcement learning. Using conventional methods for offline RL in the presence of confounding can not only lead to…

Machine Learning · Statistics 2023-11-08 Chinmaya Kausik , Yangyi Lu , Kevin Tan , Maggie Makar , Yixin Wang , Ambuj Tewari

Behavior Constraining in Weight Space for Offline Reinforcement Learning

In offline reinforcement learning, a policy needs to be learned from a single pre-collected dataset. Typically, policies are thus regularized during training to behave similarly to the data generating policy, by adding a penalty based on a…

Machine Learning · Computer Science 2021-07-13 Phillip Swazinna , Steffen Udluft , Daniel Hein , Thomas Runkler

Offline Model-Based Optimization via Policy-Guided Gradient Search

Offline optimization is an emerging problem in many experimental engineering domains including protein, drug or aircraft design, where online experimentation to collect evaluation data is too expensive or dangerous. To avoid that, one has…

Machine Learning · Computer Science 2024-05-10 Yassine Chemingui , Aryan Deshwal , Trong Nghia Hoang , Janardhan Rao Doppa

Policy Optimization via Importance Sampling

Policy optimization is an effective reinforcement learning approach to solve continuous control tasks. Recent achievements have shown that alternating online and offline optimization is a successful choice for efficient trajectory reuse.…

Machine Learning · Computer Science 2018-11-01 Alberto Maria Metelli , Matteo Papini , Francesco Faccio , Marcello Restelli

Off-Policy Evaluation in Embedded Spaces

Off-policy evaluation methods are important in recommendation systems and search engines, where data collected under an existing logging policy is used to estimate the performance of a new proposed policy. A common approach to this problem…

Machine Learning · Computer Science 2023-01-04 Jaron J. R. Lee , David Arbour , Georgios Theocharous

Importance Sampling Placement in Off-Policy Temporal-Difference Methods

A central challenge to applying many off-policy reinforcement learning algorithms to real world problems is the variance introduced by importance sampling. In off-policy learning, the agent learns about a different policy than the one being…

Machine Learning · Computer Science 2022-06-20 Eric Graves , Sina Ghiassian

Off-Policy Learning to Reason Works Because It Is More Pessimistic Than You Think

Large scale reinforcement learning has become a central tool for improving reasoning in large language models. At this scale, generation is often lagged or asynchronous, so updates are performed on data collected by older policies. This…

Machine Learning · Computer Science 2026-05-28 Otmane Sakhi , Aleksei Arzhantsev , Imad Aouali , Flavian Vasile

Counterfactual-Augmented Importance Sampling for Semi-Offline Policy Evaluation

In applying reinforcement learning (RL) to high-stakes domains, quantitative and qualitative evaluation using observational data can help practitioners understand the generalization performance of new policies. However, this type of…

Machine Learning · Computer Science 2023-10-27 Shengpu Tang , Jenna Wiens

Importance Weighted Policy Learning and Adaptation

The ability to exploit prior experience to solve novel problems rapidly is a hallmark of biological learning systems and of great practical importance for artificial ones. In the meta reinforcement learning literature much recent work has…

Machine Learning · Computer Science 2021-06-07 Alexandre Galashov , Jakub Sygnowski , Guillaume Desjardins , Jan Humplik , Leonard Hasenclever , Rae Jeong , Yee Whye Teh , Nicolas Heess

What are the Statistical Limits of Offline RL with Linear Function Approximation?

Offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies. The hope is that offline reinforcement learning coupled with function approximation…

Machine Learning · Computer Science 2020-10-23 Ruosong Wang , Dean P. Foster , Sham M. Kakade

Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning

Many reinforcement learning algorithms, particularly those that rely on return estimates for policy improvement, can suffer from poor sample efficiency and training instability due to high-variance return estimates. In this paper we…

Machine Learning · Computer Science 2026-01-06 Alexander W. Goodall , Edwin Hamel-De le Court , Francesco Belardinelli

Offline A/B testing for Recommender Systems

Before A/B testing online a new version of a recommender system, it is usual to perform some offline evaluations on historical data. We focus on evaluation methods that compute an estimator of the potential uplift in revenue that could…

Machine Learning · Statistics 2018-01-23 Alexandre Gilotte , Clément Calauzènes , Thomas Nedelec , Alexandre Abraham , Simon Dollé