Related papers: Guided Policy Optimization under Partial Observabi…

Reinforcement Learning using Guided Observability

Due to recent breakthroughs, reinforcement learning (RL) has demonstrated impressive performance in challenging sequential decision-making problems. However, an open question is how to make RL cope with partial observability which is…

Machine Learning · Computer Science 2021-04-23 Stephan Weigand , Pascal Klink , Jan Peters , Joni Pajarinen

Safe Driving via Expert Guided Policy Optimization

When learning common skills like driving, beginners usually have domain experts standing by to ensure the safety of the learning process. We formulate such learning scheme under the Expert-in-the-loop Reinforcement Learning where a guardian…

Artificial Intelligence · Computer Science 2021-11-02 Zhenghao Peng , Quanyi Li , Chunxiao Liu , Bolei Zhou

Guided Constrained Policy Optimization for Dynamic Quadrupedal Robot Locomotion

Deep reinforcement learning (RL) uses model-free techniques to optimize task-specific control policies. Despite having emerged as a promising approach for complex problems, RL is still hard to use reliably for real-world applications. Apart…

Robotics · Computer Science 2020-02-25 Siddhant Gangapurwala , Alexander Mitchell , Ioannis Havoutis

Revisiting Group Relative Policy Optimization: Insights into On-Policy and Off-Policy Training

We revisit Group Relative Policy Optimization (GRPO) in both on-policy and off-policy optimization regimes. Our motivation comes from recent work on off-policy Proximal Policy Optimization (PPO), which improves training stability, sampling…

Machine Learning · Computer Science 2025-06-02 Youssef Mroueh , Nicolas Dupuis , Brian Belgodere , Apoorva Nitsure , Mattia Rigotti , Kristjan Greenewald , Jiri Navratil , Jerret Ross , Jesus Rios

GPO: Growing Policy Optimization for Legged Robot Locomotion and Whole-Body Control

Training reinforcement learning (RL) policies for legged robots remains challenging due to high-dimensional continuous actions, hardware constraints, and limited exploration. Existing methods for locomotion and whole-body control work well…

Robotics · Computer Science 2026-01-29 Shuhao Liao , Peizhuo Li , Xinrong Yang , Linnan Chang , Zhaoxin Fan , Qing Wang , Lei Shi , Yuhong Cao , Wenjun Wu , Guillaume Sartoretti

GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for facilitating the self-improvement of large language models (LLMs), particularly in the domain of complex reasoning tasks. However,…

Machine Learning · Computer Science 2025-07-17 Ziru Liu , Cheng Gong , Xinyu Fu , Yaofang Liu , Ran Chen , Shoubo Hu , Suiyun Zhang , Rui Liu , Qingfu Zhang , Dandan Tu

Extending Group Relative Policy Optimization to Continuous Control: A Theoretical Framework for Robotic Reinforcement Learning

Group Relative Policy Optimization (GRPO) has shown promise in discrete action spaces by eliminating value function dependencies through group-based advantage estimation. However, its application to continuous control remains unexplored,…

Robotics · Computer Science 2025-07-29 Rajat Khanda , Mohammad Baqar , Sambuddha Chakrabarti , Satyasaran Changdar

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL)…

Computation and Language · Computer Science 2026-01-09 Shih-Yang Liu , Xin Dong , Ximing Lu , Shizhe Diao , Peter Belcak , Mingjie Liu , Min-Hung Chen , Hongxu Yin , Yu-Chiang Frank Wang , Kwang-Ting Cheng , Yejin Choi , Jan Kautz , Pavlo Molchanov

Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO

Reinforcement learning (RL) has proven effective in strengthening the reasoning capabilities of large language models (LLMs). A widely adopted method, Group Relative Policy Optimization (GRPO), has shown strong empirical results in training…

Machine Learning · Computer Science 2026-03-11 Peter Chen , Xiaopeng Li , Ziniu Li , Xi Chen , Tianyi Lin

Proximal Policy Optimization with Mixed Distributed Training

Instability and slowness are two main problems in deep reinforcement learning. Even if proximal policy optimization (PPO) is the state of the art, it still suffers from these two problems. We introduce an improved algorithm based on…

Machine Learning · Computer Science 2019-10-01 Zhenyu Zhang , Xiangfeng Luo , Tong Liu , Shaorong Xie , Jianshu Wang , Wei Wang , Yang Li , Yan Peng

Reflective Policy Optimization

On-policy reinforcement learning methods, like Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), often demand extensive data per update, leading to sample inefficiency. This paper introduces Reflective Policy…

Machine Learning · Computer Science 2024-06-07 Yaozhong Gan , Renye Yan , Zhe Wu , Junliang Xing

Proximal Policy Optimization for Tracking Control Exploiting Future Reference Information

In recent years, reinforcement learning (RL) has gained increasing attention in control engineering. Especially, policy gradient methods are widely used. In this work, we improve the tracking performance of proximal policy optimization…

Machine Learning · Computer Science 2021-07-21 Jana Mayer , Johannes Westermann , Juan Pedro Gutiérrez H. Muriedas , Uwe Mettin , Alexander Lampe

G$^2$RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance

Reinforcement Learning with Verifiable Rewards (RLVR) has markedly enhanced the reasoning abilities of large language models (LLMs). Its success, however, largely depends on strong base models with rich world knowledge, yielding only modest…

Artificial Intelligence · Computer Science 2025-08-19 Yongxin Guo , Wenbo Deng , Zhenglin Cheng , Xiaoying Tang

Discovered Policy Optimisation

Tremendous progress has been made in reinforcement learning (RL) over the past decade. Most of these advancements came through the continual development of new algorithms, which were designed using a combination of mathematical derivations,…

Machine Learning · Computer Science 2022-10-14 Chris Lu , Jakub Grudzien Kuba , Alistair Letcher , Luke Metz , Christian Schroeder de Witt , Jakob Foerster

Constrained Policy Optimization

For many applications of reinforcement learning it can be more convenient to specify both a reward function and constraints, rather than trying to design behavior through the reward function. For example, systems that physically interact…

Machine Learning · Computer Science 2017-05-31 Joshua Achiam , David Held , Aviv Tamar , Pieter Abbeel

GOPO: Policy Optimization using Ranked Rewards

Standard reinforcement learning from human feedback (RLHF) trains a reward model on pairwise preference data and then uses it for policy optimization. However, while reward models are optimized to capture relative preferences, existing…

Machine Learning · Computer Science 2026-02-05 Kyuseong Choi , Dwaipayan Saha , Woojeong Kim , Anish Agarwal , Raaz Dwivedi

Combining Benefits from Trajectory Optimization and Deep Reinforcement Learning

Recent breakthroughs both in reinforcement learning and trajectory optimization have made significant advances towards real world robotic system deployment. Reinforcement learning (RL) can be applied to many problems without needing any…

Robotics · Computer Science 2019-10-23 Guillaume Bellegarda , Katie Byl

Hybrid Group Relative Policy Optimization: A Multi-Sample Approach to Enhancing Policy Optimization

Hybrid Group Relative Policy Optimization (Hybrid GRPO) is a reinforcement learning framework that extends Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) by incorporating empirical multi-sample action…

Machine Learning · Computer Science 2025-02-05 Soham Sane

Constrained Reinforcement Learning Under Model Mismatch

Existing studies on constrained reinforcement learning (RL) may obtain a well-performing policy in the training environment. However, when deployed in a real environment, it may easily violate constraints that were originally satisfied…

Machine Learning · Computer Science 2024-05-06 Zhongchang Sun , Sihong He , Fei Miao , Shaofeng Zou

GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning

The Group Relative Policy Optimization (GRPO), a reinforcement learning method used to fine-tune large language models (LLMs), has proved its effectiveness in practical applications such as DeepSeek-R1. It raises a question whether GRPO can…

Machine Learning · Computer Science 2025-11-20 Yanchen Xu , Ziheng Jiao , Hongyuan Zhang , Xuelong Li