Group Sequence Policy Optimization

Chujie Zheng; Shixuan Liu; Mingze Li; Xiong-Hui Chen; Bowen Yu; Chang Gao; Kai Dang; Yuqiong Liu; Rui Men; An Yang; Jingren Zhou; Junyang Lin

Group Sequence Policy Optimization

Machine Learning 2025-07-29 v2 Artificial Intelligence Computation and Language

Authors: Chujie Zheng , Shixuan Liu , Mingze Li , Xiong-Hui Chen , Bowen Yu , Chang Gao , Kai Dang , Yuqiong Liu , Rui Men , An Yang , Jingren Zhou , Junyang Lin

View on arXiv ↗ PDF ↗

Abstract

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.

Keywords

policy gradient hyperparameter optimization reinforcement learning

Cite

@article{arxiv.2507.18071,
  title  = {Group Sequence Policy Optimization},
  author = {Chujie Zheng and Shixuan Liu and Mingze Li and Xiong-Hui Chen and Bowen Yu and Chang Gao and Kai Dang and Yuqiong Liu and Rui Men and An Yang and Jingren Zhou and Junyang Lin},
  journal= {arXiv preprint arXiv:2507.18071},
  year   = {2025}
}

Group Sequence Policy Optimization

Abstract

Keywords

Cite

Related papers