Related papers: Deep Conservative Policy Iteration
We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on several approximate variations of the Policy Iteration algorithm: Approximate Policy Iteration, Conservative Policy…
Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form…
We consider approximate dynamic programming in $\gamma$-discounted Markov decision processes and apply it to approximate planning with linear value-function approximation. Our first contribution is a new variant of Approximate Policy…
Tackling large approximate dynamic programming or reinforcement learning problems requires methods that can exploit regularities, or intrinsic structure, of the problem in hand. Most current methods are geared towards exploiting the…
We introduce Reliable Policy Iteration (RPI) and Conservative RPI (CRPI), variants of Policy Iteration (PI) and Conservative PI (CPI), that retain tabular guarantees under function approximation. RPI uses a novel Bellman-constrained…
Recently, a novel class of Approximate Policy Iteration (API) algorithms have demonstrated impressive practical performance (e.g., ExIt from [2], AlphaGo-Zero from [27]). This new family of algorithms maintains, and alternately optimizes,…
This paper proposes a general incremental policy iteration adaptive dynamic programming (ADP) algorithm for model-free robust optimal control of unknown nonlinear systems. The approach integrates recursive least squares estimation with…
We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on Policy Search algorithms, that compute an approximately optimal policy by following the standard Policy Iteration (PI)…
Policy iteration (PI) is a recursive process of policy evaluation and improvement for solving an optimal decision-making/control problem, or in other words, a reinforcement learning (RL) problem. PI has also served as the fundamental for…
Recent successful deep reinforcement learning algorithms, such as Trust Region Policy Optimization (TRPO) or Proximal Policy Optimization (PPO), are fundamentally variations of conservative policy iteration (CPI). These algorithms iterate…
Deep reinforcement learning (DRL) policies are vulnerable to unauthorized replication attacks, where an adversary exploits imitation learning to reproduce target policies from observed behavior. In this paper, we propose Constrained…
In a recent work, we proposed Reliable Policy Iteration (RPI), that restores policy iteration's monotonicity-of-value-estimates property to the function approximation setting. Here, we assess the robustness of RPI's empirical performance on…
Model-free reinforcement learning algorithms combined with value function approximation have recently achieved impressive performance in a variety of application domains. However, the theoretical understanding of such algorithms is limited,…
In optimal control problem, policy iteration (PI) is a powerful reinforcement learning (RL) tool used for designing optimal controller for the linear systems. However, the need for an initial stabilizing control policy significantly limits…
This paper introduces Deep Policy Iteration (DPI), a novel approach that integrates the strengths of Neural Networks with the stability and convergence advantages of Policy Iteration (PI) to address high-dimensional stochastic Mean Field…
Policy Iteration (PI) is a classical family of algorithms to compute an optimal policy for any given Markov Decision Problem (MDP). The basic idea in PI is to begin with some initial policy and to repeatedly update the policy to one from an…
Given a discounted cost, we study deterministic discrete-time systems whose inputs are generated by policy iteration (PI). We provide novel near-optimality and stability properties, while allowing for non stabilizing initial policies. That…
This paper presents a constrained adaptive dynamic programming (CADP) algorithm to solve general nonlinear nonaffine optimal control problems with known dynamics. Unlike previous ADP algorithms, it can directly deal with problems with state…
Safety is the priority concern when applying reinforcement learning (RL) algorithms to real-world control problems. While policy iteration provides a fundamental algorithm for standard RL, an analogous theoretical algorithm for safe RL…
Entropy regularized algorithms such as Soft Q-learning and Soft Actor-Critic, recently showed state-of-the-art performance on a number of challenging reinforcement learning (RL) tasks. The regularized formulation modifies the standard RL…