Related papers: A Convergent Off-Policy Temporal Difference Algori…

Backstepping Temporal Difference Learning

Off-policy learning ability is an important feature of reinforcement learning (RL) for practical applications. However, even one of the most elementary RL algorithms, temporal-difference (TD) learning, is known to suffer form divergence…

Machine Learning · Computer Science 2025-04-21 Han-Dong Lim , Donghwan Lee

Investigating practical linear temporal difference learning

Off-policy reinforcement learning has many applications including: learning from demonstration, learning multiple goal seeking policies in parallel, and representing predictive knowledge. Recently there has been an proliferation of new…

Machine Learning · Computer Science 2016-04-01 Adam White , Martha White

Online Off-policy Prediction

This paper investigates the problem of online prediction learning, where learning proceeds continuously as the agent interacts with an environment. The predictions made by the agent are contingent on a particular way of behaving,…

Machine Learning · Computer Science 2018-11-08 Sina Ghiassian , Andrew Patterson , Martha White , Richard S. Sutton , Adam White

On a convergent off -policy temporal difference learning algorithm in on-line learning environment

In this paper we provide a rigorous convergence analysis of a "off"-policy temporal difference learning algorithm with linear function approximation and per time-step linear computational complexity in "online" learning environment. The…

Machine Learning · Computer Science 2016-05-20 Prasenjit Karmakar , Rajkumar Maity , Shalabh Bhatnagar

O$^2$TD: (Near)-Optimal Off-Policy TD Learning

Temporal difference learning and Residual Gradient methods are the most widely used temporal difference based learning algorithms; however, it has been shown that none of their objective functions is optimal w.r.t approximating the true…

Machine Learning · Computer Science 2017-04-21 Bo Liu , Daoming Lyu , Wen Dong , Saad Biaz

Off-Policy Reinforcement Learning with Loss Function Weighted by Temporal Difference Error

Training agents via off-policy deep reinforcement learning (RL) requires a large memory, named replay memory, that stores past experiences used for learning. These experiences are sampled, uniformly or non-uniformly, to create the batches…

Machine Learning · Computer Science 2022-12-27 Bumgeun Park , Taeyoung Kim , Woohyeon Moon , Luiz Felipe Vecchietti , Dongsoo Har

Gradient Descent Temporal Difference-difference Learning

Off-policy algorithms, in which a behavior policy differs from the target policy and is used to gain experience for learning, have proven to be of great practical value in reinforcement learning. However, even for simple convex problems…

Machine Learning · Computer Science 2022-09-13 Rong J. B. Zhu , James M. Murray

A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

Temporal difference learning (TD) is a simple iterative algorithm used to estimate the value function corresponding to a given policy in a Markov decision process. Although TD is one of the most widely used algorithms in reinforcement…

Machine Learning · Computer Science 2018-11-07 Jalaj Bhandari , Daniel Russo , Raghav Singal

Analysis of Off-Policy $n$-Step TD-Learning with Linear Function Approximation

This paper analyzes multi-step temporal difference (TD)-learning algorithms within the ``deadly triad'' scenario, characterized by linear function approximation, off-policy learning, and bootstrapping. In particular, we prove that $n$-step…

Machine Learning · Computer Science 2026-02-24 Han-Dong Lim , Donghwan Lee

Chaining Value Functions for Off-Policy Learning

To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn `off-policy' about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or…

Machine Learning · Computer Science 2022-02-03 Simon Schmitt , John Shawe-Taylor , Hado van Hasselt

On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning

We consider off-policy temporal-difference (TD) learning methods for policy evaluation in Markov decision processes with finite spaces and discounted reward criteria, and we present a collection of convergence results for several…

Machine Learning · Computer Science 2018-03-30 Huizhen Yu

Consistent On-Line Off-Policy Evaluation

The problem of on-line off-policy evaluation (OPE) has been actively studied in the last decade due to its importance both as a stand-alone problem and as a module in a policy improvement scheme. However, most Temporal Difference (TD) based…

Machine Learning · Statistics 2017-02-24 Assaf Hallak , Shie Mannor

Importance Sampling Placement in Off-Policy Temporal-Difference Methods

A central challenge to applying many off-policy reinforcement learning algorithms to real world problems is the variance introduced by importance sampling. In off-policy learning, the agent learns about a different policy than the one being…

Machine Learning · Computer Science 2022-06-20 Eric Graves , Sina Ghiassian

Stable and Efficient Policy Evaluation

Policy evaluation algorithms are essential to reinforcement learning due to their ability to predict the performance of a policy. However, there are two long-standing issues lying in this prediction problem that need to be tackled:…

Machine Learning · Computer Science 2021-12-30 Daoming Lyu , Bo Liu , Matthieu Geist , Wen Dong , Saad Biaz , Qi Wang

Approximate Temporal Difference Learning is a Gradient Descent for Reversible Policies

In reinforcement learning, temporal difference (TD) is the most direct algorithm to learn the value function of a policy. For large or infinite state spaces, exact representations of the value function are usually not available, and it must…

Machine Learning · Computer Science 2018-05-03 Yann Ollivier

Q($\lambda$) with Off-Policy Corrections

We propose and analyze an alternate approach to off-policy multi-step temporal difference learning, in which off-policy returns are corrected with the current Q-function in terms of rewards, rather than with the target policy in terms of…

Artificial Intelligence · Computer Science 2016-08-12 Anna Harutyunyan , Marc G. Bellemare , Tom Stepleton , Remi Munos

Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates

We consider the core reinforcement-learning problem of on-policy value function approximation from a batch of trajectory data, and focus on various issues of Temporal Difference (TD) learning and Monte Carlo (MC) policy evaluation. The two…

Machine Learning · Computer Science 2019-06-20 Hugo Penedones , Carlos Riquelme , Damien Vincent , Hartmut Maennel , Timothy Mann , Andre Barreto , Sylvain Gelly , Gergely Neu

Emphatic Algorithms for Deep Reinforcement Learning

Off-policy learning allows us to learn about possible policies of behavior from experience generated by a different behavior policy. Temporal difference (TD) learning algorithms can become unstable when combined with function approximation…

Machine Learning · Computer Science 2021-06-23 Ray Jiang , Tom Zahavy , Zhongwen Xu , Adam White , Matteo Hessel , Charles Blundell , Hado van Hasselt

Preferential Temporal Difference Learning

Temporal-Difference (TD) learning is a general and very useful tool for estimating the value function of a given policy, which in turn is required to find good policies. Generally speaking, TD learning updates states whenever they are…

Machine Learning · Computer Science 2021-08-24 Nishanth Anand , Doina Precup

An Empirical Comparison of Off-policy Prediction Learning Algorithms on the Collision Task

Off-policy prediction -- learning the value function for one policy from data generated while following another policy -- is one of the most challenging subproblems in reinforcement learning. This paper presents empirical results with…

Machine Learning · Computer Science 2021-06-15 Sina Ghiassian , Richard S. Sutton