Related papers: Multi-step Reinforcement Learning: A Unifying Algo…

A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning

Recently, a new multi-step temporal learning algorithm, called $Q(\sigma)$, unifies $n$-step Tree-Backup (when $\sigma=0$) and $n$-step Sarsa (when $\sigma=1$) by introducing a sampling parameter $\sigma$. However, similar to other…

Artificial Intelligence · Computer Science 2018-02-12 Long Yang , Minhao Shi , Qian Zheng , Wenjia Meng , Gang Pan

Double Q($\sigma$) and Q($\sigma, \lambda$): Unifying Reinforcement Learning Control Algorithms

Temporal-difference (TD) learning is an important field in reinforcement learning. Sarsa and Q-Learning are among the most used TD algorithms. The Q($\sigma$) algorithm (Sutton and Barto (2017)) unifies both. This paper extends the…

Artificial Intelligence · Computer Science 2017-11-07 Markus Dumke

Gradient Q$(\sigma, \lambda)$: A Unified Algorithm with Function Approximation for Reinforcement Learning

Full-sampling (e.g., Q-learning) and pure-expectation (e.g., Expected Sarsa) algorithms are efficient and frequently used techniques in reinforcement learning. Q$(\sigma,\lambda)$ is the first approach unifies them with eligibility trace…

Machine Learning · Computer Science 2019-09-09 Long Yang , Yu Zhang , Qian Zheng , Pengfei Li , Gang Pan

Exploring TD error as a heuristic for $\sigma$ selection in Q($\sigma$, $\lambda$)

In the landscape of TD algorithms, the Q($\sigma$, $\lambda$) algorithm is an algorithm with the ability to perform a multistep backup in an online manner while also successfully unifying the concepts of sampling with using the expectation…

Machine Learning · Computer Science 2019-12-24 Abhishek Nan

Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Target

Multi-step methods such as Retrace($\lambda$) and $n$-step $Q$-learning have become a crucial component of modern deep reinforcement learning agents. These methods are often evaluated as a part of bigger architectures and their evaluations…

Machine Learning · Computer Science 2019-02-11 J. Fernando Hernandez-Garcia , Richard S. Sutton

TBQ($\sigma$): Improving Efficiency of Trace Utilization for Off-Policy Reinforcement Learning

Off-policy reinforcement learning with eligibility traces is challenging because of the discrepancy between target policy and behavior policy. One common approach is to measure the difference between two policies in a probabilistic way,…

Machine Learning · Computer Science 2019-05-20 Longxiang Shi , Shijian Li , Longbing Cao , Long Yang , Gang Pan

Per-decision Multi-step Temporal Difference Learning with Control Variates

Multi-step temporal difference (TD) learning is an important approach in reinforcement learning, as it unifies one-step TD learning with Monte Carlo methods in a way where intermediate algorithms can outperform either extreme. They address…

Machine Learning · Computer Science 2018-09-10 Kristopher De Asis , Richard S. Sutton

Learning to Mix n-Step Returns: Generalizing lambda-Returns for Deep Reinforcement Learning

Reinforcement Learning (RL) can model complex behavior policies for goal-directed sequential decision making tasks. A hallmark of RL algorithms is Temporal Difference (TD) learning: value function for the current state is moved towards a…

Machine Learning · Computer Science 2017-11-07 Sahil Sharma , Girish Raguvir J , Srivatsan Ramesh , Balaraman Ravindran

Implicit Temporal Differences

In reinforcement learning, the TD($\lambda$) algorithm is a fundamental policy evaluation method with an efficient online implementation that is suitable for large-scale problems. One practical drawback of TD($\lambda$) is its sensitivity…

Machine Learning · Statistics 2014-12-23 Aviv Tamar , Panos Toulis , Shie Mannor , Edoardo M. Airoldi

Effective Multi-step Temporal-Difference Learning for Non-Linear Function Approximation

Multi-step temporal-difference (TD) learning, where the update targets contain information from multiple time steps ahead, is one of the most popular forms of TD learning for linear function approximation. The reason is that multi-step…

Artificial Intelligence · Computer Science 2016-08-19 Harm van Seijen

Segmenting Action-Value Functions Over Time-Scales in SARSA via TD($\Delta$)

In numerous episodic reinforcement learning (RL) environments, SARSA-based methodologies are employed to enhance policies aimed at maximizing returns over long horizons. Traditional SARSA algorithms face challenges in achieving an optimal…

Machine Learning · Computer Science 2025-09-05 Mahammad Humayoo

Time-Scale Separation in Q-Learning: Extending TD($\triangle$) for Action-Value Function Decomposition

Q-Learning is a fundamental off-policy reinforcement learning (RL) algorithm that has the objective of approximating action-value functions in order to learn optimal policies. Nonetheless, it has difficulties in reconciling bias with…

Machine Learning · Computer Science 2024-11-22 Mahammad Humayoo

Schedule Based Temporal Difference Algorithms

Learning the value function of a given policy from data samples is an important problem in Reinforcement Learning. TD($\lambda$) is a popular class of algorithms to solve this problem. However, the weights assigned to different $n$-step…

Machine Learning · Computer Science 2021-11-24 Rohan Deb , Meet Gandhi , Shalabh Bhatnagar

Adaptive Tree Backup Algorithms for Temporal-Difference Reinforcement Learning

Q($\sigma$) is a recently proposed temporal-difference learning method that interpolates between learning from expected backups and sampled backups. It has been shown that intermediate values for the interpolation parameter $\sigma \in…

Machine Learning · Computer Science 2022-06-07 Brett Daley , Isaac Chan

Beyond the One Step Greedy Approach in Reinforcement Learning

The famous Policy Iteration algorithm alternates between policy improvement and policy evaluation. Implementations of this algorithm with several variants of the latter evaluation stage, e.g, $n$-step and trace-based returns, have been…

Artificial Intelligence · Computer Science 2018-08-01 Yonathan Efroni , Gal Dalal , Bruno Scherrer , Shie Mannor

An Analysis of Action-Value Temporal-Difference Methods That Learn State Values

The hallmark feature of temporal-difference (TD) learning is bootstrapping: using value predictions to generate new value predictions. The vast majority of TD methods for control learn a policy by bootstrapping from a single action-value…

Machine Learning · Computer Science 2025-09-05 Brett Daley , Prabhat Nagarajan , Martha White , Marlos C. Machado

SMIX($\lambda$): Enhancing Centralized Value Functions for Cooperative Multi-Agent Reinforcement Learning

Learning a stable and generalizable centralized value function (CVF) is a crucial but challenging task in multi-agent reinforcement learning (MARL), as it has to deal with the issue that the joint action space increases exponentially with…

Multiagent Systems · Computer Science 2020-08-11 Xinghu Yao , Chao Wen , Yuhui Wang , Xiaoyang Tan

Implicit Q-Learning and SARSA: Liberating Policy Control from Step-Size Calibration

Q-learning and SARSA are foundational reinforcement learning algorithms whose practical success depends critically on step-size calibration. Step-sizes that are too large can cause numerical instability, while step-sizes that are too small…

Machine Learning · Statistics 2026-01-28 Hwanwoo Kim , Eric Laber

Gradient Temporal Difference with Momentum: Stability and Convergence

Gradient temporal difference (Gradient TD) algorithms are a popular class of stochastic approximation (SA) algorithms used for policy evaluation in reinforcement learning. Here, we consider Gradient TD algorithms with an additional heavy…

Machine Learning · Computer Science 2021-11-23 Rohan Deb , Shalabh Bhatnagar

A Distributional Analysis of Sampling-Based Reinforcement Learning Algorithms

We present a distributional approach to theoretical analyses of reinforcement learning algorithms for constant step-sizes. We demonstrate its effectiveness by presenting simple and unified proofs of convergence for a variety of…

Machine Learning · Computer Science 2020-03-30 Philip Amortila , Doina Precup , Prakash Panangaden , Marc G. Bellemare