Related papers: Gradient Extrapolation-Based Policy Optimization

RePO: Replay-Enhanced Policy Optimization

Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low…

Computation and Language · Computer Science 2025-06-12 Siheng Li , Zhanhui Zhou , Wai Lam , Chao Yang , Chaochao Lu

Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from…

Machine Learning · Computer Science 2026-03-19 Ziyan Wang , Zheng Wang , Xingwei Qu , Qi Cheng , Jie Fu , Shengpu Tang , Minjia Zhang , Xiaoming Huo

iGRPO: Self-Feedback-Driven LLM Reasoning

Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with…

Artificial Intelligence · Computer Science 2026-02-10 Ali Hatamizadeh , Shrimai Prabhumoye , Igor Gitman , Ximing Lu , Seungju Han , Wei Ping , Yejin Choi , Jan Kautz

XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation

Reinforcement learning algorithms such as GRPO have driven recent advances in large language model (LLM) reasoning. While scaling the number of rollouts stabilizes training, existing approaches suffer from limited exploration on challenging…

Machine Learning · Computer Science 2026-05-26 Udbhav Bamba , Minghao Fang , Yifan Yu , Haizhong Zheng , Fan Lai

CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need to…

Artificial Intelligence · Computer Science 2025-11-11 Zhihang Lin , Mingbao Lin , Yuan Xie , Rongrong Ji

GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping

Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on…

Computer Vision and Pattern Recognition · Computer Science 2025-10-31 Jing Wang , Jiajun Liang , Jie Liu , Henglin Liu , Gongye Liu , Jun Zheng , Wanyuan Pang , Ao Ma , Zhenyu Xie , Xintao Wang , Meng Wang , Pengfei Wan , Xiaodan Liang

CoRPO: Adding a Correctness Bias to GRPO Improves Generalization

Group-Relative Policy Optimization (GRPO) has emerged as the standard for training reasoning capabilities in large language models through reinforcement learning. By estimating advantages using group-mean rewards rather than a learned…

Artificial Intelligence · Computer Science 2026-03-06 Anisha Garg , Claire Zhang , Nishit Neema , David Bick , Ganesh Venkatesh , Joel Hestness

GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning

Reinforcement Learning (RL) can directly enhance the reasoning capabilities of large language models without extensive reliance on Supervised Fine-Tuning (SFT). In this work, we revisit the traditional Policy Gradient (PG) mechanism and…

Machine Learning · Computer Science 2026-02-04 Xiangxiang Chu , Hailang Huang , Xiao Zhang , Fei Wei , Yong Wang

DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization

Recent large reasoning models (LRMs) driven by reinforcement learning algorithms (e.g., GRPO) have achieved remarkable performance on challenging reasoning tasks. However, these models suffer from overthinking, generating unnecessarily long…

Artificial Intelligence · Computer Science 2026-03-03 Gang Li , Yan Chen , Ming Lin , Tianbao Yang

Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO

Reinforcement learning (RL) has proven effective in strengthening the reasoning capabilities of large language models (LLMs). A widely adopted method, Group Relative Policy Optimization (GRPO), has shown strong empirical results in training…

Machine Learning · Computer Science 2026-03-11 Peter Chen , Xiaopeng Li , Ziniu Li , Xi Chen , Tianyi Lin

RiskPO: Risk-based Policy Optimization via Verifiable Reward for LLM Post-Training

Reinforcement learning with verifiable reward has recently emerged as a central paradigm for post-training large language models (LLMs); however, prevailing mean-based methods, such as Group Relative Policy Optimization (GRPO), suffer from…

Machine Learning · Computer Science 2025-10-02 Tao Ren , Jinyang Jiang , Hui Yang , Wan Tian , Minhao Zou , Guanghao Li , Zishi Zhang , Qinghao Wang , Shentao Qin , Yanjun Zhao , Rui Tao , Hui Shao , Yijie Peng

ScRPO: From Errors to Insights

We introduce Self-correction Relative Policy Optimization (ScRPO), a novel reinforcement learning framework designed to empower large language models with advanced mathematical reasoning capabilities through iterative self-reflection and…

Artificial Intelligence · Computer Science 2026-01-06 Lianrui Li , Dakuan Lu , Jiawei Shao , Xuelong Li

PrAg-PO: Prompt Augmented Policy Optimization for Robust and Diverse Mathematical Reasoning

Reinforcement learning algorithms such as group-relative policy optimization (GRPO) have shown strong potential for improving the mathematical reasoning capabilities of large language models. While a growing body of work seeks to improve…

Machine Learning · Computer Science 2026-05-12 Wenquan Lu , Hai Huang , Enqi Liu , Randall Balestriero

AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic-free refinement…

Machine Learning · Computer Science 2026-05-21 Miaobo Hu , Shuhao Hu , Bokun Wang , Ruohan Wang , Xin Wang , Xiaobo Guo , Daren Zha , Jun Xiao

Multi-Layer GRPO: Enhancing Reasoning and Self-Correction in Large Language Models

The Group Relative Policy Optimization (GRPO) algorithm has demonstrated considerable success in enhancing the reasoning capabilities of large language models (LLMs), as evidenced by DeepSeek-R1. However, the absence of intermediate…

Machine Learning · Computer Science 2025-06-06 Fei Ding , Baiqiao Wang , Zijian Zeng , Youwei Wang

TIC-GRPO: Provable and Efficient Optimization for Reinforcement Learning from Human Feedback

Group Relative Policy Optimization (GRPO), recently introduced by DeepSeek, is a critic-free reinforcement learning algorithm for fine-tuning large language models. GRPO replaces the value function in Proximal Policy Optimization (PPO) with…

Machine Learning · Computer Science 2026-03-24 Lei Pang , Jun Luo , Ruinan Jin

GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models

Group Relative Policy Optimization (GRPO), which is widely adopted by R1-like reasoning models, has advanced mathematical reasoning. Nevertheless, GRPO faces challenges in reward sparsity, verbosity, and inadequate focus on problem…

Computation and Language · Computer Science 2025-09-23 Jixiao Zhang , Chunsheng Zuo

GTPO: Stabilizing Group Relative Policy Optimization via Gradient and Entropy Control

Group Relative Policy Optimization (GRPO) is a promising policy-based approach for Large Language Model alignment, yet its performance is often limited by training instability and suboptimal convergence. In this paper, we identify and…

Machine Learning · Computer Science 2025-12-12 Marco Simoni , Aleksandar Fontana , Giulio Rossolini , Andrea Saracino , Paolo Mori

GVPO: Group Variance Policy Optimization for Large Language Model Post-Training

Post-training plays a crucial role in refining and aligning large language models to meet specific tasks and human preferences. While recent advancements in post-training techniques, such as Group Relative Policy Optimization (GRPO),…

Artificial Intelligence · Computer Science 2025-10-28 Kaichen Zhang , Yuzhong Hong , Junwei Bao , Hongfei Jiang , Yang Song , Dingqian Hong , Hui Xiong

Group Sequence Policy Optimization

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios,…

Machine Learning · Computer Science 2025-07-29 Chujie Zheng , Shixuan Liu , Mingze Li , Xiong-Hui Chen , Bowen Yu , Chang Gao , Kai Dang , Yuqiong Liu , Rui Men , An Yang , Jingren Zhou , Junyang Lin