Related papers: Single-stream Policy Optimization

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly…

Machine Learning · Computer Science 2026-04-03 Gengsheng Li , Tianyu Yang , Junfeng Fang , Mingyang Song , Mao Zheng , Haiyun Guo , Dan Zhang , Jinqiao Wang , Tat-Seng Chua

Soft Adaptive Policy Optimization

Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often…

Machine Learning · Computer Science 2025-12-02 Chang Gao , Chujie Zheng , Xiong-Hui Chen , Kai Dang , Shixuan Liu , Bowen Yu , An Yang , Shuai Bai , Jingren Zhou , Junyang Lin

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit…

Artificial Intelligence · Computer Science 2026-04-13 Tianyi Wang , Yixia Li , Long Li , Yibiao Chen , Shaohan Huang , Yun Chen , Peng Li , Yang Liu , Guanhua Chen

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajectory…

Computation and Language · Computer Science 2026-05-29 Redacted by arXiv

Group Sequence Policy Optimization

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios,…

Machine Learning · Computer Science 2025-07-29 Chujie Zheng , Shixuan Liu , Mingze Li , Xiong-Hui Chen , Bowen Yu , Chang Gao , Kai Dang , Yuqiong Liu , Rui Men , An Yang , Jingren Zhou , Junyang Lin

SPO: Sequential Monte Carlo Policy Optimisation

Leveraging planning during learning and decision-making is central to the long-term development of intelligent agents. Recent works have successfully combined tree-based search methods and self-play learning mechanisms to this end. However,…

Artificial Intelligence · Computer Science 2024-11-01 Matthew V Macfarlane , Edan Toledo , Donal Byrne , Paul Duckworth , Alexandre Laterre

Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models

Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: token-level…

Machine Learning · Computer Science 2025-10-22 Yiran Guo , Lijie Xu , Jie Liu , Dan Ye , Shuang Qiu

Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from…

Machine Learning · Computer Science 2026-03-19 Ziyan Wang , Zheng Wang , Xingwei Qu , Qi Cheng , Jie Fu , Shengpu Tang , Minjia Zhang , Xiaoming Huo

Soft Policy Optimization: Online Off-Policy RL for Sequence Models

RL-based post-training of language models is almost exclusively done using on-policy methods such as PPO. These methods cannot learn from arbitrary sequences such as those produced earlier in training, in earlier runs, by human experts or…

Machine Learning · Computer Science 2025-03-10 Taco Cohen , David W. Zhang , Kunhao Zheng , Yunhao Tang , Remi Munos , Gabriel Synnaeve

Holder Policy Optimisation

Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level…

Machine Learning · Computer Science 2026-05-22 Yuxiang Chen , Dingli Liang , Yihang Chen , Ziqin Gong , Chenyang Le , Zhaokai Wang , Jiachen Zhu , Lingyu Yang , Jianghao Lin , Weinan Zhang , Jun Wang

Soft Sequence Policy Optimization

A significant portion of recent research on Large Language Model (LLM) alignment focuses on developing new policy optimization methods based on Group Relative Policy Optimization (GRPO). Two prominent directions have emerged: (i) a shift…

Machine Learning · Computer Science 2026-02-27 Svetlana Glazyrina , Maksim Kryzhanovskiy , Roman Ischenko

Simple Policy Optimization

Model-free reinforcement learning algorithms have seen remarkable progress, but key challenges remain. Trust Region Policy Optimization (TRPO) is known for ensuring monotonic policy improvement through conservative updates within a trust…

Machine Learning · Computer Science 2025-07-29 Zhengpeng Xie , Qiang Zhang , Fan Yang , Marco Hutter , Renjing Xu

Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning

Reinforcement learning from verifiable rewards has emerged as a powerful technique for enhancing the complex reasoning abilities of Large Language Models (LLMs). However, these methods are fundamentally constrained by the ''learning cliff''…

Computation and Language · Computer Science 2026-03-03 Xichen Zhang , Sitong Wu , Yinghao Zhu , Haoru Tan , Shaozuo Yu , Ziyi He , Jiaya Jia

Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning

Self-Distillation Policy Optimization (SDPO) provides dense token-level credit assignment for reinforcement learning with large language models by leveraging the model's own feedback-conditioned predictions as a self-teacher. Unlike GRPO,…

Machine Learning · Computer Science 2026-05-28 Zehao Liu , Yuanpu Cao , Jinghui Chen , Vasant G. Honavar

COPO: Consistency-Aware Policy Optimization

Reinforcement learning has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) in complex problem-solving tasks. Recently, the introduction of DeepSeek R1 has inspired a surge of interest in leveraging…

Machine Learning · Computer Science 2025-08-07 Jinghang Han , Jiawei Chen , Hang Shao , Hao Ma , Mingcheng Li , Xintian Shen , Lihao Zheng , Wei Chen , Tao Wei , Lihua Zhang

Mitigating Think-Answer Mismatch in LLM Reasoning Through Noise-Aware Advantage Reweighting

Group-Relative Policy Optimization (GRPO) is a key technique for training large reasoning models, yet it suffers from a critical vulnerability: the \emph{Think-Answer Mismatch}, where noisy reward signals corrupt the learning process. This…

Machine Learning · Computer Science 2025-08-11 Si Shen , Peijun Shen , Wenhua Zhao , Danhao Zhu

Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

Recently, there has been significant interest in replacing the reward model in Reinforcement Learning with Human Feedback (RLHF) methods for Large Language Models (LLMs), such as Direct Preference Optimization (DPO) and its variants. These…

Computation and Language · Computer Science 2024-09-27 Jian Li , Haojing Huang , Yujia Zhang , Pengfei Xu , Xi Chen , Rui Song , Lida Shi , Jingwen Wang , Hao Xu

Soft Preference Optimization: Aligning Language Models to Expert Distributions

We propose Soft Preference Optimization (SPO), a method for aligning generative models, such as Large Language Models (LLMs), with human preferences, without the need for a reward model. SPO optimizes model outputs directly over a…

Machine Learning · Computer Science 2024-10-07 Arsalan Sharifnassab , Saber Salehkaleybar , Sina Ghiassian , Surya Kanoria , Dale Schuurmans

Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning

Since DeepSeek-R1 popularized, Group Relative Policy Optimization (GRPO) has become the core part of training Reasoning LLMs. However, we find some deficiency that influences RL stability and inference efficiency, like zero-variance in…

Computation and Language · Computer Science 2025-09-30 Chen Li , Nazhou Liu , Kai Yang

RePO: Replay-Enhanced Policy Optimization

Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low…

Computation and Language · Computer Science 2025-06-12 Siheng Li , Zhanhui Zhou , Wai Lam , Chao Yang , Chaochao Lu