Related papers: Soft Sequence Policy Optimization

Soft Adaptive Policy Optimization

Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often…

Machine Learning · Computer Science 2025-12-02 Chang Gao , Chujie Zheng , Xiong-Hui Chen , Kai Dang , Shixuan Liu , Bowen Yu , An Yang , Shuai Bai , Jingren Zhou , Junyang Lin

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Off-policy updates are inevitable in reinforcement learning (RL) for large language models (LLMs) due to rollout staleness from asynchronous training and mismatches between training and inference engines. Naive importance sampling gives an…

Machine Learning · Computer Science 2026-05-11 Guobin Shen , Chenxiao Zhao , Xiang Cheng , Lei Huang , Xing Yu

Group Sequence Policy Optimization

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios,…

Machine Learning · Computer Science 2025-07-29 Chujie Zheng , Shixuan Liu , Mingze Li , Xiong-Hui Chen , Bowen Yu , Chang Gao , Kai Dang , Yuqiong Liu , Rui Men , An Yang , Jingren Zhou , Junyang Lin

Revisiting Group Relative Policy Optimization: Insights into On-Policy and Off-Policy Training

We revisit Group Relative Policy Optimization (GRPO) in both on-policy and off-policy optimization regimes. Our motivation comes from recent work on off-policy Proximal Policy Optimization (PPO), which improves training stability, sampling…

Machine Learning · Computer Science 2025-06-02 Youssef Mroueh , Nicolas Dupuis , Brian Belgodere , Apoorva Nitsure , Mattia Rigotti , Kristjan Greenewald , Jiri Navratil , Jerret Ross , Jesus Rios

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit…

Artificial Intelligence · Computer Science 2026-04-13 Tianyi Wang , Yixia Li , Long Li , Yibiao Chen , Shaohan Huang , Yun Chen , Peng Li , Yang Liu , Guanhua Chen

Improving LLM Reasoning for Vulnerability Detection via Group Relative Policy Optimization

Improving and understanding the training dynamics and reasoning of Large Language Models (LLMs) has become essential for their deployment in AI-based security tools, such as software vulnerability detection. In this work, we present an…

Cryptography and Security · Computer Science 2025-07-08 Marco Simoni , Aleksandar Fontana , Giulio Rossolini , Andrea Saracino

Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from…

Machine Learning · Computer Science 2026-03-19 Ziyan Wang , Zheng Wang , Xingwei Qu , Qi Cheng , Jie Fu , Shengpu Tang , Minjia Zhang , Xiaoming Huo

Soft Policy Optimization: Online Off-Policy RL for Sequence Models

RL-based post-training of language models is almost exclusively done using on-policy methods such as PPO. These methods cannot learn from arbitrary sequences such as those produced earlier in training, in earlier runs, by human experts or…

Machine Learning · Computer Science 2025-03-10 Taco Cohen , David W. Zhang , Kunhao Zheng , Yunhao Tang , Remi Munos , Gabriel Synnaeve

Sharpness-Guided Group Relative Policy Optimization via Probability Shaping

Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. However, RLVR training…

Machine Learning · Computer Science 2026-05-14 Tue Le , Linh Ngo Van , Trung Le

SSPO: Subsentence-level Policy Optimization

As a key component of large language model (LLM) post-training, Reinforcement Learning from Verifiable Rewards (RLVR) has substantially improved reasoning performance. However, existing RLVR algorithms exhibit distinct stability issues:…

Computation and Language · Computer Science 2026-04-13 Kun Yang , Zikang chen , Yanmeng Wang , Zhigen Li , Ning Cheng , Shaojun Wang , Jing Xiao

RiskPO: Risk-based Policy Optimization via Verifiable Reward for LLM Post-Training

Reinforcement learning with verifiable reward has recently emerged as a central paradigm for post-training large language models (LLMs); however, prevailing mean-based methods, such as Group Relative Policy Optimization (GRPO), suffer from…

Machine Learning · Computer Science 2025-10-02 Tao Ren , Jinyang Jiang , Hui Yang , Wan Tian , Minhao Zou , Guanghao Li , Zishi Zhang , Qinghao Wang , Shentao Qin , Yanjun Zhao , Rui Tao , Hui Shao , Yijie Peng

GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning

The Group Relative Policy Optimization (GRPO), a reinforcement learning method used to fine-tune large language models (LLMs), has proved its effectiveness in practical applications such as DeepSeek-R1. It raises a question whether GRPO can…

Machine Learning · Computer Science 2025-11-20 Yanchen Xu , Ziheng Jiao , Hongyuan Zhang , Xuelong Li

Smooth Gate Functions for Soft Advantage Policy Optimization

Group Relative Policy Optimization (GRPO) has significantly advanced the training of large language models and enhanced their reasoning capabilities, while it remains susceptible to instability due to the use of hard clipping. Soft Adaptive…

Machine Learning · Computer Science 2026-03-26 Egor Denisov , Svetlana Glazyrina , Maksim Kryzhanovskiy , Roman Ischenko

Information-Consistent Language Model Recommendations through Group Relative Policy Optimization

Large Language Models (LLMs) are increasingly deployed in business-critical domains such as finance, education, healthcare, and customer support, where users expect consistent and reliable recommendations. Yet LLMs often exhibit variability…

Machine Learning · Computer Science 2026-04-20 Sonal Prabhune , Balaji Padmanabhan , Kaushik Dutta

Token-Efficient RL for LLM Reasoning

We propose reinforcement learning (RL) strategies tailored for reasoning in large language models (LLMs) under strict memory and compute limits, with a particular focus on compatibility with LoRA fine-tuning. Building on early policy…

Machine Learning · Computer Science 2025-06-13 Alan Lee , Harry Tong

Group Causal Policy Optimization for Post-Training Large Language Models

Recent advances in large language models (LLMs) have broadened their applicability across diverse tasks, yet specialized domains still require targeted post training. Among existing methods, Group Relative Policy Optimization (GRPO) stands…

Machine Learning · Computer Science 2025-08-08 Ziyin Gu , Jingyao Wang , Ran Zuo , Chuxiong Sun , Zeen Song , Changwen Zheng , Wenwen Qiang

How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning

Group Relative Policy Optimization (GRPO) has been a key driver of recent progress in reinforcement learning with verifiable rewards (RLVR) for large language models, but it is typically trained in a low-staleness, near-on-policy regime…

Machine Learning · Computer Science 2026-05-19 Minghao Tian , Yunfei Xie , Chen Wei

Mitigating Think-Answer Mismatch in LLM Reasoning Through Noise-Aware Advantage Reweighting

Group-Relative Policy Optimization (GRPO) is a key technique for training large reasoning models, yet it suffers from a critical vulnerability: the \emph{Think-Answer Mismatch}, where noisy reward signals corrupt the learning process. This…

Machine Learning · Computer Science 2025-08-11 Si Shen , Peijun Shen , Wenhua Zhao , Danhao Zhu

GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

Reinforcement Learning (RL) is pivotal for enhancing Large Language Model (LLM) reasoning, yet mainstream algorithms such as GRPO and DAPO remain constrained by a coarse-grained credit assignment paradigm, where all tokens within the same…

Computation and Language · Computer Science 2026-02-06 Hongze Tan , Zihan Wang , Jianfei Pan , Jinghao Lin , Hao Wang , Yifan Wu , Tao Chen , Zhihang Zheng , Zhihao Tang , Haihua Yang

ESPO: Entropy Importance Sampling Policy Optimization

Reinforcement learning (RL) has become a central component of post-training for large language models (LLMs), particularly for complex reasoning tasks that require stable optimization over long generation horizons. However, achieving…

Machine Learning · Computer Science 2026-02-17 Yuepeng Sheng , Yuwei Huang , Shuman Liu , Anxiang Zeng , Haibo Zhang