Related papers: Case-based off-policy policy evaluation using prot…

Importance Sampling Policy Evaluation with an Estimated Behavior Policy

We consider the problem of off-policy evaluation in Markov decision processes. Off-policy evaluation is the task of evaluating the expected return of one policy with data generated by a different, behavior policy. Importance sampling is a…

Machine Learning · Computer Science 2019-05-13 Josiah P. Hanna , Scott Niekum , Peter Stone

Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

We consider the off-policy estimation problem of estimating the expected reward of a target policy using samples collected by a different behavior policy. Importance sampling (IS) has been a key technique to derive (nearly) unbiased…

Machine Learning · Computer Science 2018-10-31 Qiang Liu , Lihong Li , Ziyang Tang , Dengyong Zhou

Policy Gradient with Active Importance Sampling

Importance sampling (IS) represents a fundamental technique for a large surge of off-policy reinforcement learning approaches. Policy gradient (PG) methods, in particular, significantly benefit from IS, enabling the effective reuse of…

Machine Learning · Computer Science 2024-05-10 Matteo Papini , Giorgio Manganini , Alberto Maria Metelli , Marcello Restelli

Importance Resampling for Off-policy Prediction

Importance sampling (IS) is a common reweighting strategy for off-policy prediction in reinforcement learning. While it is consistent and unbiased, it can result in high variance updates to the weights for the value function. In this work,…

Machine Learning · Computer Science 2019-11-15 Matthew Schlegel , Wesley Chung , Daniel Graves , Jian Qian , Martha White

On the Reuse Bias in Off-Policy Reinforcement Learning

Importance sampling (IS) is a popular technique in off-policy evaluation, which re-weights the return of trajectories in the replay buffer to boost sample efficiency. However, training with IS can be unstable and previous attempts to…

Machine Learning · Computer Science 2025-05-20 Chengyang Ying , Zhongkai Hao , Xinning Zhou , Hang Su , Dong Yan , Jun Zhu

Relative Importance Sampling for off-Policy Actor-Critic in Deep Reinforcement Learning

Off-policy learning exhibits greater instability when compared to on-policy learning in reinforcement learning (RL). The difference in probability distribution between the target policy ($\pi$) and the behavior policy (b) is a major cause…

Machine Learning · Computer Science 2025-08-19 Mahammad Humayoo , Gengzhong Zheng , Xiaoqing Dong , Liming Miao , Shuwei Qiu , Zexun Zhou , Peitao Wang , Zakir Ullah , Naveed Ur Rehman Junejo , Xueqi Cheng

Importance Sampling Placement in Off-Policy Temporal-Difference Methods

A central challenge to applying many off-policy reinforcement learning algorithms to real world problems is the variance introduced by importance sampling. In off-policy learning, the agent learns about a different policy than the one being…

Machine Learning · Computer Science 2022-06-20 Eric Graves , Sina Ghiassian

Uncertainty-Aware Instance Reweighting for Off-Policy Learning

Off-policy learning, referring to the procedure of policy optimization with access only to logged feedback data, has shown importance in various real-world applications, such as search engines, recommender systems, and etc. While the…

Machine Learning · Computer Science 2023-09-28 Xiaoying Zhang , Junpu Chen , Hongning Wang , Hong Xie , Yang Liu , John C. S. Lui , Hang Li

Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation

Evaluating a policy by deploying it in the real world can be risky and costly. Off-policy policy evaluation (OPE) algorithms use historical data collected from running a previous policy to evaluate a new policy, which provides a means for…

Artificial Intelligence · Computer Science 2017-12-07 Zhaohan Daniel Guo , Philip S. Thomas , Emma Brunskill

Understanding the Curse of Horizon in Off-Policy Evaluation via Conditional Importance Sampling

Off-policy policy estimators that use importance sampling (IS) can suffer from high variance in long-horizon domains, and there has been particular excitement over new IS methods that leverage the structure of Markov decision processes. We…

Machine Learning · Computer Science 2020-06-09 Yao Liu , Pierre-Luc Bacon , Emma Brunskill

Value-aware Importance Weighting for Off-policy Reinforcement Learning

Importance sampling is a central idea underlying off-policy prediction in reinforcement learning. It provides a strategy for re-weighting samples from a distribution to obtain unbiased estimates under another distribution. However,…

Machine Learning · Computer Science 2023-06-28 Kristopher De Asis , Eric Graves , Richard S. Sutton

Demystifying the Paradox of Importance Sampling with an Estimated History-Dependent Behavior Policy in Off-Policy Evaluation

This paper studies off-policy evaluation (OPE) in reinforcement learning with a focus on behavior policy estimation for importance sampling. Prior work has shown empirically that estimating a history-dependent behavior policy can lead to…

Machine Learning · Computer Science 2025-05-29 Hongyi Zhou , Josiah P. Hanna , Jin Zhu , Ying Yang , Chengchun Shi

Deeply-Debiased Off-Policy Interval Estimation

Off-policy evaluation learns a target policy's value with a historical dataset generated by a different behavior policy. In addition to a point estimate, many applications would benefit significantly from having a confidence interval (CI)…

Machine Learning · Statistics 2021-06-09 Chengchun Shi , Runzhe Wan , Victor Chernozhukov , Rui Song

Online Off-policy Prediction

This paper investigates the problem of online prediction learning, where learning proceeds continuously as the agent interacts with an environment. The predictions made by the agent are contingent on a particular way of behaving,…

Machine Learning · Computer Science 2018-11-08 Sina Ghiassian , Andrew Patterson , Martha White , Richard S. Sutton , Adam White

Leveraging Factored Action Spaces for Off-Policy Evaluation

Off-policy evaluation (OPE) aims to estimate the benefit of following a counterfactual sequence of actions, given data collected from executed sequences. However, existing OPE estimators often exhibit high bias and high variance in problems…

Machine Learning · Computer Science 2023-07-17 Aaman Rebello , Shengpu Tang , Jenna Wiens , Sonali Parbhoo

Offline Policy Optimization with Eligible Actions

Offline policy optimization could have a large impact on many real-world decision-making problems, as online learning may be infeasible in many applications. Importance sampling and its variants are a commonly used type of estimator in…

Machine Learning · Computer Science 2022-07-05 Yao Liu , Yannis Flet-Berliac , Emma Brunskill

Policy Optimization Through Approximate Importance Sampling

Recent policy optimization approaches (Schulman et al., 2015a; 2017) have achieved substantial empirical successes by constructing new proxy optimization objectives. These proxy objectives allow stable and low variance policy learning, but…

Machine Learning · Computer Science 2020-02-24 Marcin B. Tomczak , Dongho Kim , Peter Vrancx , Kee-Eung Kim

Reliable Off-policy Evaluation for Reinforcement Learning

In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy using logged trajectory data generated from a different behavior policy, without execution of the target policy.…

Machine Learning · Computer Science 2022-11-04 Jie Wang , Rui Gao , Hongyuan Zha

Logarithmic Accuracy in Importance Sampling via Large Deviations

Importance sampling (IS) is a widely used simulation method for estimating rare event probabilities. In IS, the relative variance of an estimator is the most common measure of estimator accuracy, and the focus of existing literature is on…

Statistics Theory · Mathematics 2026-01-05 Julie Choi , Peter Glynn

Behaviour Policy Estimation in Off-Policy Policy Evaluation: Calibration Matters

In this work, we consider the problem of estimating a behaviour policy for use in Off-Policy Policy Evaluation (OPE) when the true behaviour policy is unknown. Via a series of empirical studies, we demonstrate how accurate OPE is strongly…

Machine Learning · Computer Science 2018-07-11 Aniruddh Raghu , Omer Gottesman , Yao Liu , Matthieu Komorowski , Aldo Faisal , Finale Doshi-Velez , Emma Brunskill