Related papers: Ratio-Variance Regularized Policy Optimization

Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning

On-policy reinforcement learning (RL), particularly Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), has become the dominant paradigm for fine-tuning large language models (LLMs). While policy ratio clipping…

Machine Learning · Computer Science 2026-01-08 Yu Luo , Shuo Han , Yihan Hu , Dong Li , Jianye Hao

Simple Policy Optimization

Model-free reinforcement learning algorithms have seen remarkable progress, but key challenges remain. Trust Region Policy Optimization (TRPO) is known for ensuring monotonic policy improvement through conservative updates within a trust…

Machine Learning · Computer Science 2025-07-29 Zhengpeng Xie , Qiang Zhang , Fan Yang , Marco Hutter , Renjing Xu

Stable Policy Optimization via Off-Policy Divergence Regularization

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL). While these methods achieve state-of-the-art performance across a…

Machine Learning · Computer Science 2020-06-22 Ahmed Touati , Amy Zhang , Joelle Pineau , Pascal Vincent

Rectified Robust Policy Optimization for Model-Uncertain Constrained Reinforcement Learning without Strong Duality

The goal of robust constrained reinforcement learning (RL) is to optimize an agent's performance under the worst-case model uncertainty while satisfying safety or resource constraints. In this paper, we demonstrate that strong duality does…

Machine Learning · Computer Science 2025-09-23 Shaocong Ma , Ziyi Chen , Yi Zhou , Heng Huang

Rethinking the Trust Region in LLM Reinforcement Learning

Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio…

Machine Learning · Computer Science 2026-05-27 Penghui Qi , Xiangxin Zhou , Zichen Liu , Tianyu Pang , Chao Du , Min Lin , Wee Sun Lee

Truly Proximal Policy Optimization

Proximal policy optimization (PPO) is one of the most successful deep reinforcement-learning methods, achieving state-of-the-art performance across a wide range of challenging tasks. However, its optimization behavior is still far from…

Machine Learning · Computer Science 2020-01-15 Yuhui Wang , Hao He , Chao Wen , Xiaoyang Tan

Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be 'close' to one another, is…

Machine Learning · Computer Science 2019-12-13 Lior Shani , Yonathan Efroni , Shie Mannor

RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

Differentiable reinforcement learning (RL) frameworks like DiffRO offer a powerful approach for controllable text-to-speech (TTS), but are vulnerable to reward hacking, particularly for nuanced tasks like emotion control. The policy model…

Sound · Computer Science 2026-02-17 Cong Wang , Changfeng Gao , Yang Xiang , Zhihao Du , Keyu An , Han Zhao , Qian Chen , Xiangang Li , Yingming Gao , Ya Li

Clipped-Objective Policy Gradients for Pessimistic Policy Optimization

To facilitate efficient learning, policy gradient approaches to deep reinforcement learning (RL) are typically paired with variance reduction measures and strategies for making large but safe policy changes based on a batch of experiences.…

Machine Learning · Computer Science 2023-11-13 Jared Markowitz , Edward W. Staley

Optimistic Distributionally Robust Policy Optimization

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), as the widely employed policy based reinforcement learning (RL) methods, are prone to converge to a sub-optimal solution as they limit the policy representation…

Machine Learning · Computer Science 2020-06-16 Jun Song , Chaoyue Zhao

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Off-policy updates are inevitable in reinforcement learning (RL) for large language models (LLMs) due to rollout staleness from asynchronous training and mismatches between training and inference engines. Naive importance sampling gives an…

Machine Learning · Computer Science 2026-05-11 Guobin Shen , Chenxiao Zhao , Xiang Cheng , Lei Huang , Xing Yu

R$^2$PO: Decoupling Training Trajectories from Inference Responses for LLM Reasoning

Reinforcement learning has become a central paradigm for improving LLM reasoning. However, existing methods use a single policy to produce both inference responses and training optimization trajectories. The objective conflict between…

Machine Learning · Computer Science 2026-01-26 Jingchu Wang , Bingbing Xu , Yige Yuan , Bin Xie , Xiaoqian Sun , Huawei Shen

BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck:…

Machine Learning · Computer Science 2026-03-06 Yuan Li , Bo Wang , Yufei Gao , Yuqian Yao , Xinyuan Wang , Zhangyue Yin , Xipeng Qiu

Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification

Group Relative Policy Optimization (GRPO) was introduced and used recently for promoting reasoning in LLMs under verifiable (binary) rewards. We show that the mean + variance calibration of these rewards induces a weighted contrastive loss…

Machine Learning · Computer Science 2025-10-22 Youssef Mroueh

Policy Optimization for $\mathcal{H}_2$ Linear Control with $\mathcal{H}_\infty$ Robustness Guarantee: Implicit Regularization and Global Convergence

Policy optimization (PO) is a key ingredient for reinforcement learning (RL). For control design, certain constraints are usually enforced on the policies to optimize, accounting for either the stability, robustness, or safety concerns on…

Optimization and Control · Mathematics 2021-02-16 Kaiqing Zhang , Bin Hu , Tamer Başar

Value-Free Policy Optimization via Reward Partitioning

Single-trajectory reinforcement learning (RL) methods aim to optimize policies from datasets consisting of (prompt, response, reward) triplets, where scalar rewards are directly available. This supervision format is highly practical, as it…

Machine Learning · Computer Science 2025-12-23 Bilal Faye , Hanane Azzag , Mustapha Lebbah

GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping

Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on…

Computer Vision and Pattern Recognition · Computer Science 2025-10-31 Jing Wang , Jiajun Liang , Jie Liu , Henglin Liu , Gongye Liu , Jun Zheng , Wanyuan Pang , Ao Ma , Zhenyu Xie , Xintao Wang , Meng Wang , Pengfei Wan , Xiaodan Liang

Bounded Ratio Reinforcement Learning

Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect between the underlying…

Machine Learning · Computer Science 2026-05-01 Yunke Ao , Le Chen , Bruce D. Lee , Assefa S. Wahd , Aline Czarnobai , Philipp Fürnstahl , Bernhard Schölkopf , Andreas Krause

Improving Value Estimation Critically Enhances Vanilla Policy Gradient

Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks. Questioning the common belief that enforcing approximate trust regions leads to steady policy improvement in practice, we show…

Machine Learning · Computer Science 2025-05-27 Tao Wang , Ruipeng Zhang , Sicun Gao

Stabilizing Reinforcement Learning for Diffusion Language Models

Group Relative Policy Optimization (GRPO) is highly effective for post-training autoregressive (AR) language models, yet its direct application to diffusion large language models (dLLMs) often triggers reward collapse. We identify two…

Machine Learning · Computer Science 2026-03-10 Jianyuan Zhong , Kaibo Wang , Ding Ding , Zijin Feng , Haoli Bai , Yang Xiang , Jiacheng Sun , Qiang Xu