Related papers: Sample Complexity Bounds for Two Timescale Value-b…
Greedy-GQ is an off-policy two timescale algorithm for optimal control in reinforcement learning. This paper develops the first finite-sample analysis for the Greedy-GQ algorithm with linear function approximation under Markovian noise. Our…
Gradient-based temporal difference (GTD) algorithms are widely used in off-policy learning scenarios. Among them, the two time-scale TD with gradient correction (TDC) algorithm has been shown to have superior performance. In contrast to…
Temporal-difference learning with gradient correction (TDC) is a two time-scale algorithm for policy evaluation in reinforcement learning. This algorithm was initially proposed with linear function approximation, and was later extended to…
Greedy-GQ with linear function approximation, originally proposed in \cite{maei2010toward}, is a value-based off-policy algorithm for optimal control in reinforcement learning, and it has a non-linear two timescale structure with the…
Variance reduction techniques have been successfully applied to temporal-difference (TD) learning and help to improve the sample complexity in policy evaluation. However, the existing work applied variance reduction to either the less…
Greedy-GQ is a value-based reinforcement learning (RL) algorithm for optimal control. Recently, the finite-time analysis of Greedy-GQ has been developed under linear function approximation and Markovian sampling, and the algorithm is shown…
We devise a distributional variant of gradient temporal-difference (TD) learning. Distributional reinforcement learning has been demonstrated to outperform the regular one in the recent study \citep{bellemare2017distributional}. In the…
Two-timescale Stochastic Approximation (SA) algorithms are widely used in Reinforcement Learning (RL). Their iterates have two parts that are updated using distinct stepsizes. In this work, we develop a novel recipe for their finite sample…
Stochastic approximation (SA) is an iterative algorithm for finding the fixed point of an operator using noisy samples and widely used in optimization and Reinforcement Learning (RL). The noise in RL exhibits a Markovian structure, and in…
Policy evaluation in reinforcement learning is often conducted using two-timescale stochastic approximation, which results in various gradient temporal difference methods such as GTD(0), GTD2, and TDC. Here, we provide convergence rate…
As an important type of reinforcement learning algorithms, actor-critic (AC) and natural actor-critic (NAC) algorithms are often executed in two ways for finding optimal policies. In the first nested-loop design, actor's one update of…
Despite the wide applications of Adam in reinforcement learning (RL), the theoretical convergence of Adam-type RL algorithms has not been established. This paper provides the first such convergence analysis for two fundamental RL algorithms…
We present for the first time an asymptotic convergence analysis of two time-scale stochastic approximation driven by `controlled' Markov noise. In particular, both the faster and slower recursions have non-additive controlled Markov noise…
In this paper, we introduce proximal gradient temporal difference learning, which provides a principled way of designing and analyzing true stochastic gradient temporal difference learning algorithms. We show how gradient TD (GTD)…
Two-time-scale Stochastic Approximation (SA) is an iterative algorithm with applications in reinforcement learning and optimization. Prior finite time analysis of such algorithms has focused on fixed point iterations with mappings…
Temporal-Difference (TD) learning with nonlinear smooth function approximation for policy evaluation has achieved great success in modern reinforcement learning. It is shown that such a problem can be reformulated as a stochastic…
Linear two-timescale stochastic approximation (SA) scheme is an important class of algorithms which has become popular in reinforcement learning (RL), particularly for the policy evaluation problem. Recently, a number of works have been…
We consider off-policy temporal-difference (TD) learning methods for policy evaluation in Markov decision processes with finite spaces and discounted reward criteria, and we present a collection of convergence results for several…
Motivated by broad applications in machine learning, we study the popular accelerated stochastic gradient descent (ASGD) algorithm for solving (possibly nonconvex) optimization problems. We characterize the finite-time performance of this…
Motivated by applications in reinforcement learning (RL), we study a nonlinear stochastic approximation (SA) algorithm under Markovian noise, and establish its finite-sample convergence bounds under various stepsizes. Specifically, we show…