Related papers: Stabilizing Reinforcement Learning for Diffusion L…

Simple Policy Gradients for Reasoning with Diffusion Language Models

Diffusion large language models (dLLMs), which offer a promising alternative to traditional autoregressive LLMs, have recently shown strong results in pretraining. However, due to their lack of tractable sequence-level likelihoods, they…

Machine Learning · Computer Science 2026-02-03 Anthony Zhan

Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes

Reinforcement learning (RL) has proven remarkably effective at improving the accuracy of language models in verifiable and deterministic domains like mathematics. Here, we examine if current RL methods are also effective at optimizing…

Machine Learning · Computer Science 2025-08-19 Michael Bereket , Jure Leskovec

Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization

Diffusion language models (DLMs) enable parallel, order-agnostic generation with iterative refinement, offering a flexible alternative to autoregressive large language models (LLMs). However, adapting reinforcement learning (RL) fine-tuning…

Machine Learning · Computer Science 2026-02-12 Kevin Rojas , Jiahe Lin , Kashif Rasul , Anderson Schneider , Yuriy Nevmyvaka , Molei Tao , Wei Deng

Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification

Group Relative Policy Optimization (GRPO) was introduced and used recently for promoting reasoning in LLMs under verifiable (binary) rewards. We show that the mean + variance calibration of these rewards induces a weighted contrastive loss…

Machine Learning · Computer Science 2025-10-22 Youssef Mroueh

Relative Score Policy Optimization for Diffusion Language Models

Diffusion large language models (dLLMs) offer a promising route to parallel and efficient text generation, but improving their reasoning ability requires effective post-training. Reinforcement learning with verifiable rewards (RLVR) is a…

Computation and Language · Computer Science 2026-05-12 Zichao Yu , Shengze Xu , Bingqing Jiang , Wenyi Zhang , Difan Zou

Rethinking the Trust Region in LLM Reinforcement Learning

Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio…

Machine Learning · Computer Science 2026-05-27 Penghui Qi , Xiangxin Zhou , Zichen Liu , Tianyu Pang , Chao Du , Min Lin , Wee Sun Lee

Can GRPO Help LLMs Transcend Their Pretraining Origin?

Reinforcement Learning with Verifiable Rewards (RLVR), primarily driven by the Group Relative Policy Optimization (GRPO) algorithm, is a leading approach for enhancing the reasoning abilities of Large Language Models (LLMs). Despite its…

Machine Learning · Computer Science 2025-10-21 Kangqi Ni , Zhen Tan , Zijie Liu , Pingzhi Li , Tianlong Chen

GTPO: Stabilizing Group Relative Policy Optimization via Gradient and Entropy Control

Group Relative Policy Optimization (GRPO) is a promising policy-based approach for Large Language Model alignment, yet its performance is often limited by training instability and suboptimal convergence. In this paper, we identify and…

Machine Learning · Computer Science 2025-12-12 Marco Simoni , Aleksandar Fontana , Giulio Rossolini , Andrea Saracino , Paolo Mori

iGRPO: Self-Feedback-Driven LLM Reasoning

Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with…

Artificial Intelligence · Computer Science 2026-02-10 Ali Hatamizadeh , Shrimai Prabhumoye , Igor Gitman , Ximing Lu , Seungju Han , Wei Ping , Yejin Choi , Jan Kautz

Information-Consistent Language Model Recommendations through Group Relative Policy Optimization

Large Language Models (LLMs) are increasingly deployed in business-critical domains such as finance, education, healthcare, and customer support, where users expect consistent and reliable recommendations. Yet LLMs often exhibit variability…

Machine Learning · Computer Science 2026-04-20 Sonal Prabhune , Balaji Padmanabhan , Kaushik Dutta

Why GRPO Needs Normalization: A Local-Curvature Perspective on Adaptive Gradients

Reinforcement learning (RL) has become a key driver of language model reasoning. Among RL algorithms, Group Relative Policy Optimization (GRPO) is the de facto standard, avoiding the need for a critic by using per-prompt baselines and…

Machine Learning · Computer Science 2026-02-02 Cheng Ge , Caitlyn Heqi Yin , Hao Liang , Jiawei Zhang

M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization

Self-supervised reinforcement learning (RL) presents a promising approach for enhancing the reasoning capabilities of Large Language Models (LLMs) without reliance on expensive human-annotated data. However, we find that existing methods…

Artificial Intelligence · Computer Science 2025-12-16 Bizhe Bai , Hongming Wu , Peng Ye , Tao Chen

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly…

Machine Learning · Computer Science 2026-04-03 Gengsheng Li , Tianyu Yang , Junfeng Fang , Mingyang Song , Mao Zheng , Haiyun Guo , Dan Zhang , Jinqiao Wang , Tat-Seng Chua

Reinforcing Diffusion Models by Direct Group Preference Optimization

While reinforcement learning methods such as Group Relative Preference Optimization (GRPO) have significantly enhanced Large Language Models, adapting them to diffusion models remains challenging. In particular, GRPO demands a stochastic…

Machine Learning · Computer Science 2025-10-10 Yihong Luo , Tianyang Hu , Jing Tang

How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning

Group Relative Policy Optimization (GRPO) has been a key driver of recent progress in reinforcement learning with verifiable rewards (RLVR) for large language models, but it is typically trained in a low-staleness, near-on-policy regime…

Machine Learning · Computer Science 2026-05-19 Minghao Tian , Yunfei Xie , Chen Wei

Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning

On-policy reinforcement learning (RL), particularly Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), has become the dominant paradigm for fine-tuning large language models (LLMs). While policy ratio clipping…

Machine Learning · Computer Science 2026-01-08 Yu Luo , Shuo Han , Yihan Hu , Dong Li , Jianye Hao

DaGRPO: Rectifying Gradient Conflict in Reasoning via Distinctiveness-Aware Group Relative Policy Optimization

The evolution of Large Language Models (LLMs) has catalyzed a paradigm shift from superficial instruction following to rigorous long-horizon reasoning. While Group Relative Policy Optimization (GRPO) has emerged as a pivotal mechanism for…

Artificial Intelligence · Computer Science 2026-01-01 Xuan Xie , Xuan Wang , Wenjie Wang , Shuai Chen , Wei Lin

Mitigating Think-Answer Mismatch in LLM Reasoning Through Noise-Aware Advantage Reweighting

Group-Relative Policy Optimization (GRPO) is a key technique for training large reasoning models, yet it suffers from a critical vulnerability: the \emph{Think-Answer Mismatch}, where noisy reward signals corrupt the learning process. This…

Machine Learning · Computer Science 2025-08-11 Si Shen , Peijun Shen , Wenhua Zhao , Danhao Zhu

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL)…

Computation and Language · Computer Science 2026-01-09 Shih-Yang Liu , Xin Dong , Ximing Lu , Shizhe Diao , Peter Belcak , Mingjie Liu , Min-Hung Chen , Hongxu Yin , Yu-Chiang Frank Wang , Kwang-Ting Cheng , Yejin Choi , Jan Kautz , Pavlo Molchanov

Sharpness-Guided Group Relative Policy Optimization via Probability Shaping

Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. However, RLVR training…

Machine Learning · Computer Science 2026-05-14 Tue Le , Linh Ngo Van , Trung Le