Related papers: Optimistic Policy Regularization

Absolute Policy Optimization

In recent years, trust region on-policy reinforcement learning has achieved impressive results in addressing complex control tasks and gaming scenarios. However, contemporary state-of-the-art algorithms within this category primarily…

Machine Learning · Computer Science 2024-05-31 Weiye Zhao , Feihan Li , Yifan Sun , Rui Chen , Tianhao Wei , Changliu Liu

Adaptive Advantage-Guided Policy Regularization for Offline Reinforcement Learning

In offline reinforcement learning, the challenge of out-of-distribution (OOD) is pronounced. To address this, existing methods often constrain the learned policy through policy regularization. However, these methods often suffer from the…

Machine Learning · Computer Science 2024-07-16 Tenglong Liu , Yang Li , Yixing Lan , Hao Gao , Wei Pan , Xin Xu

Robust Policy Optimization in Deep Reinforcement Learning

The policy gradient method enjoys the simplicity of the objective where the agent optimizes the cumulative reward directly. Moreover, in the continuous action domain, parameterized distribution of action distribution allows easy control of…

Machine Learning · Computer Science 2022-12-16 Md Masudur Rahman , Yexiang Xue

Optimistic Distributionally Robust Policy Optimization

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), as the widely employed policy based reinforcement learning (RL) methods, are prone to converge to a sub-optimal solution as they limit the policy representation…

Machine Learning · Computer Science 2020-06-16 Jun Song , Chaoyue Zhao

Optimistic Proximal Policy Optimization

Reinforcement Learning, a machine learning framework for training an autonomous agent based on rewards, has shown outstanding results in various domains. However, it is known that learning a good policy is difficult in a domain where…

Machine Learning · Computer Science 2019-06-27 Takahisa Imagawa , Takuya Hiraoka , Yoshimasa Tsuruoka

On-Policy RL with Optimal Reward Baseline

Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability…

Machine Learning · Computer Science 2025-06-05 Yaru Hao , Li Dong , Xun Wu , Shaohan Huang , Zewen Chi , Furu Wei

Action Robust Reinforcement Learning via Optimal Adversary Aware Policy Optimization

Reinforcement Learning (RL) has achieved remarkable success in sequential decision tasks. However, recent studies have revealed the vulnerability of RL policies to different perturbations, raising concerns about their effectiveness and…

Machine Learning · Computer Science 2025-07-08 Buqing Nie , Yangqing Fu , Jingtian Ji , Yue Gao

Revisiting Design Choices in Proximal Policy Optimization

Proximal Policy Optimization (PPO) is a popular deep policy gradient algorithm. In standard implementations, PPO regularizes policy updates with clipped probability ratios, and parameterizes policies with either continuous Gaussian…

Machine Learning · Computer Science 2020-09-24 Chloe Ching-Yun Hsu , Celestine Mendler-Dünner , Moritz Hardt

PAPO: Stabilizing Rubric Integration Training via Decoupled Advantage Normalization

We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward…

Artificial Intelligence · Computer Science 2026-04-06 Zelin Tan , Zhouliang Yu , Bohan Lin , Zijie Geng , Hejia Geng , Yudong Zhang , Mulei Zhang , Yang Chen , Shuyue Hu , Zhenfei Yin , Chen Zhang , Lei Bai

PTR-PPO: Proximal Policy Optimization with Prioritized Trajectory Replay

On-policy deep reinforcement learning algorithms have low data utilization and require significant experience for policy improvement. This paper proposes a proximal policy optimization algorithm with prioritized trajectory replay (PTR-PPO)…

Machine Learning · Computer Science 2021-12-09 Xingxing Liang , Yang Ma , Yanghe Feng , Zhong Liu

Adversarial Policy Optimization in Deep Reinforcement Learning

The policy represented by the deep neural network can overfit the spurious features in observations, which hamper a reinforcement learning agent from learning effective policy. This issue becomes severe in high-dimensional state, where the…

Machine Learning · Computer Science 2023-05-01 Md Masudur Rahman , Yexiang Xue

Optimistic Natural Policy Gradient: a Simple Efficient Policy Optimization Framework for Online RL

While policy optimization algorithms have played an important role in recent empirical success of Reinforcement Learning (RL), the existing theoretical understanding of policy optimization remains rather limited -- they are either…

Machine Learning · Computer Science 2023-12-05 Qinghua Liu , Gellért Weisz , András György , Chi Jin , Csaba Szepesvári

On the Convergence of Approximate and Regularized Policy Iteration Schemes

Entropy regularized algorithms such as Soft Q-learning and Soft Actor-Critic, recently showed state-of-the-art performance on a number of challenging reinforcement learning (RL) tasks. The regularized formulation modifies the standard RL…

Machine Learning · Statistics 2019-10-15 Elena Smirnova , Elvis Dohmatob

Iteratively Refined Behavior Regularization for Offline Reinforcement Learning

One of the fundamental challenges for offline reinforcement learning (RL) is ensuring robustness to data distribution. Whether the data originates from a near-optimal policy or not, we anticipate that an algorithm should demonstrate its…

Machine Learning · Computer Science 2023-10-18 Xiaohan Hu , Yi Ma , Chenjun Xiao , Yan Zheng , Jianye Hao

Efficient Deep Reinforcement Learning with Predictive Processing Proximal Policy Optimization

Advances in reinforcement learning (RL) often rely on massive compute resources and remain notoriously sample inefficient. In contrast, the human brain is able to efficiently learn effective control strategies using limited resources. This…

Machine Learning · Computer Science 2024-01-30 Burcu Küçükoğlu , Walraaf Borkent , Bodo Rueckauer , Nasir Ahmad , Umut Güçlü , Marcel van Gerven

Proximal Policy Optimization Algorithms

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent.…

Machine Learning · Computer Science 2017-08-29 John Schulman , Filip Wolski , Prafulla Dhariwal , Alec Radford , Oleg Klimov

PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization

Despite Proximal Policy Optimization (PPO) dominating policy gradient methods -- from robotic control to game AI -- its static trust region forces a brittle trade-off: aggressive clipping stifles early exploration, while late-stage updates…

Machine Learning · Computer Science 2025-05-26 Ben Rahman

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical…

Machine Learning · Computer Science 2026-02-11 Wujiang Xu , Wentian Zhao , Zhenting Wang , Yu-Jhe Li , Can Jin , Mingyu Jin , Kai Mei , Kun Wan , Dimitris N. Metaxas

Evolutionary Policy Optimization

On-policy reinforcement learning (RL) algorithms are widely used for their strong asymptotic performance and training stability, but they struggle to scale with larger batch sizes, as additional parallel environments yield redundant data…

Machine Learning · Computer Science 2025-11-13 Jianren Wang , Yifan Su , Abhinav Gupta , Deepak Pathak

Experience Replay Optimization

Experience replay enables reinforcement learning agents to memorize and reuse past experiences, just as humans replay memories for the situation at hand. Contemporary off-policy algorithms either replay past experiences uniformly or utilize…

Machine Learning · Computer Science 2019-06-21 Daochen Zha , Kwei-Herng Lai , Kaixiong Zhou , Xia Hu