Related papers: Modified Actor-Critics

Proximal Policy Optimization Algorithms

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent.…

Machine Learning · Computer Science 2017-08-29 John Schulman , Filip Wolski , Prafulla Dhariwal , Alec Radford , Oleg Klimov

Truly Proximal Policy Optimization

Proximal policy optimization (PPO) is one of the most successful deep reinforcement-learning methods, achieving state-of-the-art performance across a wide range of challenging tasks. However, its optimization behavior is still far from…

Machine Learning · Computer Science 2020-01-15 Yuhui Wang , Hao He , Chao Wen , Xiaoyang Tan

Simple Policy Optimization

Model-free reinforcement learning algorithms have seen remarkable progress, but key challenges remain. Trust Region Policy Optimization (TRPO) is known for ensuring monotonic policy improvement through conservative updates within a trust…

Machine Learning · Computer Science 2025-07-29 Zhengpeng Xie , Qiang Zhang , Fan Yang , Marco Hutter , Renjing Xu

Reflective Policy Optimization

On-policy reinforcement learning methods, like Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), often demand extensive data per update, leading to sample inefficiency. This paper introduces Reflective Policy…

Machine Learning · Computer Science 2024-06-07 Yaozhong Gan , Renye Yan , Zhe Wu , Junliang Xing

Stable Policy Optimization via Off-Policy Divergence Regularization

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL). While these methods achieve state-of-the-art performance across a…

Machine Learning · Computer Science 2020-06-22 Ahmed Touati , Amy Zhang , Joelle Pineau , Pascal Vincent

Trust Region-Guided Proximal Policy Optimization

Proximal policy optimization (PPO) is one of the most popular deep reinforcement learning (RL) methods, achieving state-of-the-art performance across a wide range of challenging tasks. However, as a model-free RL method, the success of PPO…

Machine Learning · Computer Science 2019-11-11 Yuhui Wang , Hao He , Xiaoyang Tan , Yaozhong Gan

Transductive Off-policy Proximal Policy Optimization

Proximal Policy Optimization (PPO) is a popular model-free reinforcement learning algorithm, esteemed for its simplicity and efficacy. However, due to its inherent on-policy nature, its proficiency in harnessing data from disparate policies…

Machine Learning · Computer Science 2024-06-07 Yaozhong Gan , Renye Yan , Xiaoyang Tan , Zhe Wu , Junliang Xing

Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER

We prove under commonly used assumptions the convergence of actor-critic reinforcement learning algorithms, which simultaneously learn a policy function, the actor, and a value function, the critic. Both functions can be deep neural…

Machine Learning · Computer Science 2020-12-03 Markus Holzleitner , Lukas Gruber , José Arjona-Medina , Johannes Brandstetter , Sepp Hochreiter

Proximal Policy Optimization with Mixed Distributed Training

Instability and slowness are two main problems in deep reinforcement learning. Even if proximal policy optimization (PPO) is the state of the art, it still suffers from these two problems. We introduce an improved algorithm based on…

Machine Learning · Computer Science 2019-10-01 Zhenyu Zhang , Xiangfeng Luo , Tong Liu , Shaorong Xie , Jianshu Wang , Wei Wang , Yang Li , Yan Peng

Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be 'close' to one another, is…

Machine Learning · Computer Science 2019-12-13 Lior Shani , Yonathan Efroni , Shie Mannor

ExO-PPO: an Extended Off-policy Proximal Policy Optimization Algorithm

Deep reinforcement learning has been able to solve various tasks successfully, however, due to the construction of policy gradient and training dynamics, tuning deep reinforcement learning models remains challenging. As one of the most…

Machine Learning · Computer Science 2026-02-11 Hanyong Wang , Menglong Yang

Supervised Policy Update for Deep Reinforcement Learning

We propose a new sample-efficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning. Starting with data generated by the current policy, SPU formulates and solves a constrained optimization problem in the…

Machine Learning · Computer Science 2018-12-27 Quan Vuong , Yiming Zhang , Keith W. Ross

TIC-GRPO: Provable and Efficient Optimization for Reinforcement Learning from Human Feedback

Group Relative Policy Optimization (GRPO), recently introduced by DeepSeek, is a critic-free reinforcement learning algorithm for fine-tuning large language models. GRPO replaces the value function in Proximal Policy Optimization (PPO) with…

Machine Learning · Computer Science 2026-03-24 Lei Pang , Jun Luo , Ruinan Jin

An Adaptive Clipping Approach for Proximal Policy Optimization

Very recently proximal policy optimization (PPO) algorithms have been proposed as first-order optimization methods for effective reinforcement learning. While PPO is inspired by the same learning theory that justifies trust region policy…

Machine Learning · Computer Science 2018-04-20 Gang Chen , Yiming Peng , Mengjie Zhang

Approximate Modified Policy Iteration

Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form…

Artificial Intelligence · Computer Science 2012-05-21 Bruno Scherrer , Victor Gabillon , Mohammad Ghavamzadeh , Matthieu Geist

An Approximate Policy Iteration Viewpoint of Actor-Critic Algorithms

In this work, we consider policy-based methods for solving the reinforcement learning problem, and establish the sample complexity guarantees. A policy-based algorithm typically consists of an actor and a critic. We consider using various…

Machine Learning · Computer Science 2023-01-16 Zaiwei Chen , Siva Theja Maguluri

On the Convergence of Approximate and Regularized Policy Iteration Schemes

Entropy regularized algorithms such as Soft Q-learning and Soft Actor-Critic, recently showed state-of-the-art performance on a number of challenging reinforcement learning (RL) tasks. The regularized formulation modifies the standard RL…

Machine Learning · Statistics 2019-10-15 Elena Smirnova , Elvis Dohmatob

Trust Region Policy Optimization

We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy…

Machine Learning · Computer Science 2017-04-24 John Schulman , Sergey Levine , Philipp Moritz , Michael I. Jordan , Pieter Abbeel

Clipped-Objective Policy Gradients for Pessimistic Policy Optimization

To facilitate efficient learning, policy gradient approaches to deep reinforcement learning (RL) are typically paired with variance reduction measures and strategies for making large but safe policy changes based on a batch of experiences.…

Machine Learning · Computer Science 2023-11-13 Jared Markowitz , Edward W. Staley

Mirror Learning: A Unifying Framework of Policy Optimisation

Modern deep reinforcement learning (RL) algorithms are motivated by either the generalised policy iteration (GPI) or trust-region learning (TRL) frameworks. However, algorithms that strictly respect these theoretical frameworks have proven…

Machine Learning · Computer Science 2024-11-21 Jakub Grudzien Kuba , Christian Schroeder de Witt , Jakob Foerster