English
Related papers

Related papers: Gradient Extrapolation-Based Policy Optimization

200 papers

Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low…

Computation and Language · Computer Science 2025-06-12 Siheng Li , Zhanhui Zhou , Wai Lam , Chao Yang , Chaochao Lu

Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from…

Machine Learning · Computer Science 2026-03-19 Ziyan Wang , Zheng Wang , Xingwei Qu , Qi Cheng , Jie Fu , Shengpu Tang , Minjia Zhang , Xiaoming Huo

Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with…

Artificial Intelligence · Computer Science 2026-02-10 Ali Hatamizadeh , Shrimai Prabhumoye , Igor Gitman , Ximing Lu , Seungju Han , Wei Ping , Yejin Choi , Jan Kautz

Reinforcement learning algorithms such as GRPO have driven recent advances in large language model (LLM) reasoning. While scaling the number of rollouts stabilizes training, existing approaches suffer from limited exploration on challenging…

Machine Learning · Computer Science 2026-05-26 Udbhav Bamba , Minghao Fang , Yifan Yu , Haizhong Zheng , Fan Lai

This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need to…

Artificial Intelligence · Computer Science 2025-11-11 Zhihang Lin , Mingbao Lin , Yuan Xie , Rongrong Ji

Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on…

Computer Vision and Pattern Recognition · Computer Science 2025-10-31 Jing Wang , Jiajun Liang , Jie Liu , Henglin Liu , Gongye Liu , Jun Zheng , Wanyuan Pang , Ao Ma , Zhenyu Xie , Xintao Wang , Meng Wang , Pengfei Wan , Xiaodan Liang

Group-Relative Policy Optimization (GRPO) has emerged as the standard for training reasoning capabilities in large language models through reinforcement learning. By estimating advantages using group-mean rewards rather than a learned…

Artificial Intelligence · Computer Science 2026-03-06 Anisha Garg , Claire Zhang , Nishit Neema , David Bick , Ganesh Venkatesh , Joel Hestness

Reinforcement Learning (RL) can directly enhance the reasoning capabilities of large language models without extensive reliance on Supervised Fine-Tuning (SFT). In this work, we revisit the traditional Policy Gradient (PG) mechanism and…

Machine Learning · Computer Science 2026-02-04 Xiangxiang Chu , Hailang Huang , Xiao Zhang , Fei Wei , Yong Wang

Recent large reasoning models (LRMs) driven by reinforcement learning algorithms (e.g., GRPO) have achieved remarkable performance on challenging reasoning tasks. However, these models suffer from overthinking, generating unnecessarily long…

Artificial Intelligence · Computer Science 2026-03-03 Gang Li , Yan Chen , Ming Lin , Tianbao Yang

Reinforcement learning (RL) has proven effective in strengthening the reasoning capabilities of large language models (LLMs). A widely adopted method, Group Relative Policy Optimization (GRPO), has shown strong empirical results in training…

Machine Learning · Computer Science 2026-03-11 Peter Chen , Xiaopeng Li , Ziniu Li , Xi Chen , Tianyi Lin

Reinforcement learning with verifiable reward has recently emerged as a central paradigm for post-training large language models (LLMs); however, prevailing mean-based methods, such as Group Relative Policy Optimization (GRPO), suffer from…

Machine Learning · Computer Science 2025-10-02 Tao Ren , Jinyang Jiang , Hui Yang , Wan Tian , Minhao Zou , Guanghao Li , Zishi Zhang , Qinghao Wang , Shentao Qin , Yanjun Zhao , Rui Tao , Hui Shao , Yijie Peng

We introduce Self-correction Relative Policy Optimization (ScRPO), a novel reinforcement learning framework designed to empower large language models with advanced mathematical reasoning capabilities through iterative self-reflection and…

Artificial Intelligence · Computer Science 2026-01-06 Lianrui Li , Dakuan Lu , Jiawei Shao , Xuelong Li

Reinforcement learning algorithms such as group-relative policy optimization (GRPO) have shown strong potential for improving the mathematical reasoning capabilities of large language models. While a growing body of work seeks to improve…

Machine Learning · Computer Science 2026-05-12 Wenquan Lu , Hai Huang , Enqi Liu , Randall Balestriero

Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic-free refinement…

Machine Learning · Computer Science 2026-05-21 Miaobo Hu , Shuhao Hu , Bokun Wang , Ruohan Wang , Xin Wang , Xiaobo Guo , Daren Zha , Jun Xiao

The Group Relative Policy Optimization (GRPO) algorithm has demonstrated considerable success in enhancing the reasoning capabilities of large language models (LLMs), as evidenced by DeepSeek-R1. However, the absence of intermediate…

Machine Learning · Computer Science 2025-06-06 Fei Ding , Baiqiao Wang , Zijian Zeng , Youwei Wang

Group Relative Policy Optimization (GRPO), recently introduced by DeepSeek, is a critic-free reinforcement learning algorithm for fine-tuning large language models. GRPO replaces the value function in Proximal Policy Optimization (PPO) with…

Machine Learning · Computer Science 2026-03-24 Lei Pang , Jun Luo , Ruinan Jin

Group Relative Policy Optimization (GRPO), which is widely adopted by R1-like reasoning models, has advanced mathematical reasoning. Nevertheless, GRPO faces challenges in reward sparsity, verbosity, and inadequate focus on problem…

Computation and Language · Computer Science 2025-09-23 Jixiao Zhang , Chunsheng Zuo

Group Relative Policy Optimization (GRPO) is a promising policy-based approach for Large Language Model alignment, yet its performance is often limited by training instability and suboptimal convergence. In this paper, we identify and…

Machine Learning · Computer Science 2025-12-12 Marco Simoni , Aleksandar Fontana , Giulio Rossolini , Andrea Saracino , Paolo Mori

Post-training plays a crucial role in refining and aligning large language models to meet specific tasks and human preferences. While recent advancements in post-training techniques, such as Group Relative Policy Optimization (GRPO),…

Artificial Intelligence · Computer Science 2025-10-28 Kaichen Zhang , Yuzhong Hong , Junwei Bao , Hongfei Jiang , Yang Song , Dingqian Hong , Hui Xiong

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios,…

Machine Learning · Computer Science 2025-07-29 Chujie Zheng , Shixuan Liu , Mingze Li , Xiong-Hui Chen , Bowen Yu , Chang Gao , Kai Dang , Yuqiong Liu , Rui Men , An Yang , Jingren Zhou , Junyang Lin
‹ Prev 1 2 3 10 Next ›