Related papers: Soft Adaptive Policy Optimization

Soft Sequence Policy Optimization

A significant portion of recent research on Large Language Model (LLM) alignment focuses on developing new policy optimization methods based on Group Relative Policy Optimization (GRPO). Two prominent directions have emerged: (i) a shift…

Machine Learning · Computer Science 2026-02-27 Svetlana Glazyrina , Maksim Kryzhanovskiy , Roman Ischenko

Smooth Gate Functions for Soft Advantage Policy Optimization

Group Relative Policy Optimization (GRPO) has significantly advanced the training of large language models and enhanced their reasoning capabilities, while it remains susceptible to instability due to the use of hard clipping. Soft Adaptive…

Machine Learning · Computer Science 2026-03-26 Egor Denisov , Svetlana Glazyrina , Maksim Kryzhanovskiy , Roman Ischenko

Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning

Since DeepSeek-R1 popularized, Group Relative Policy Optimization (GRPO) has become the core part of training Reasoning LLMs. However, we find some deficiency that influences RL stability and inference efficiency, like zero-variance in…

Computation and Language · Computer Science 2025-09-30 Chen Li , Nazhou Liu , Kai Yang

SSPO: Subsentence-level Policy Optimization

As a key component of large language model (LLM) post-training, Reinforcement Learning from Verifiable Rewards (RLVR) has substantially improved reasoning performance. However, existing RLVR algorithms exhibit distinct stability issues:…

Computation and Language · Computer Science 2026-04-13 Kun Yang , Zikang chen , Yanmeng Wang , Zhigen Li , Ning Cheng , Shaojun Wang , Jing Xiao

Group Sequence Policy Optimization

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios,…

Machine Learning · Computer Science 2025-07-29 Chujie Zheng , Shixuan Liu , Mingze Li , Xiong-Hui Chen , Bowen Yu , Chang Gao , Kai Dang , Yuqiong Liu , Rui Men , An Yang , Jingren Zhou , Junyang Lin

Segment-Aligned Policy Optimization for Multi-Modal Reasoning

Existing reinforcement learning approaches for Large Language Models typically perform policy optimization at the granularity of individual tokens or entire response sequences. However, such formulations often misalign with the natural…

Artificial Intelligence · Computer Science 2026-05-08 Lei Gao , Zhuoming Li , Mengxi Jia , Jiakang Yuan , Hongbo Sun , Hao Sun , Xuelong Li

ESPO: Entropy Importance Sampling Policy Optimization

Reinforcement learning (RL) has become a central component of post-training for large language models (LLMs), particularly for complex reasoning tasks that require stable optimization over long generation horizons. However, achieving…

Machine Learning · Computer Science 2026-02-17 Yuepeng Sheng , Yuwei Huang , Shuman Liu , Anxiang Zeng , Haibo Zhang

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings--where stale data from past policies are used for training--improves…

Machine Learning · Computer Science 2025-10-23 Zhiheng Xi , Xin Guo , Yang Nan , Enyu Zhou , Junrui Shen , Wenxiang Chen , Jiaqi Liu , Jixuan Huang , Zhihao Zhang , Honglin Guo , Xun Deng , Zhikai Lei , Miao Zheng , Guoteng Wang , Shuo Zhang , Peng Sun , Rui Zheng , Hang Yan , Tao Gui , Qi Zhang , Xuanjing Huang

Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from…

Machine Learning · Computer Science 2026-03-19 Ziyan Wang , Zheng Wang , Xingwei Qu , Qi Cheng , Jie Fu , Shengpu Tang , Minjia Zhang , Xiaoming Huo

Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models

Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: token-level…

Machine Learning · Computer Science 2025-10-22 Yiran Guo , Lijie Xu , Jie Liu , Dan Ye , Shuang Qiu

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Off-policy updates are inevitable in reinforcement learning (RL) for large language models (LLMs) due to rollout staleness from asynchronous training and mismatches between training and inference engines. Naive importance sampling gives an…

Machine Learning · Computer Science 2026-05-11 Guobin Shen , Chenxiao Zhao , Xiang Cheng , Lei Huang , Xing Yu

It's Not You, It's Clipping: A Soft Trust-Region via Probability Smoothing for LLM RL

Training large language models (LLMs) with reinforcement learning (RL) methods such as PPO and GRPO commonly relies on ratio clipping to stabilise updates. While effective at preventing instability, clipping discards information, introduces…

Machine Learning · Computer Science 2026-02-02 Madeleine Dwyer , Adam Sobey , Adriane Chapman

Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning

Recent advances in large language models (LLMs) have popularized test-time scaling, where models generate additional reasoning tokens before producing final answers. These approaches have demonstrated significant performance improvements on…

Artificial Intelligence · Computer Science 2026-01-13 Wenxun Wu , Yuanyang Li , Guhan Chen , Linyue Wang , Hongyang Chen

GTPO: Stabilizing Group Relative Policy Optimization via Gradient and Entropy Control

Group Relative Policy Optimization (GRPO) is a promising policy-based approach for Large Language Model alignment, yet its performance is often limited by training instability and suboptimal convergence. In this paper, we identify and…

Machine Learning · Computer Science 2025-12-12 Marco Simoni , Aleksandar Fontana , Giulio Rossolini , Andrea Saracino , Paolo Mori

GIPO: Gaussian Importance Sampling Policy Optimization

Post-training with reinforcement learning (RL) has recently shown strong promise for advancing multimodal agents beyond supervised imitation. However, RL remains limited by poor data efficiency, particularly in settings where interaction…

Machine Learning · Computer Science 2026-03-05 Chengxuan Lu , Zhenquan Zhang , Shukuan Wang , Qunzhi Lin , Baigui Sun , Yang Liu

Soft Policy Optimization: Online Off-Policy RL for Sequence Models

RL-based post-training of language models is almost exclusively done using on-policy methods such as PPO. These methods cannot learn from arbitrary sequences such as those produced earlier in training, in earlier runs, by human experts or…

Machine Learning · Computer Science 2025-03-10 Taco Cohen , David W. Zhang , Kunhao Zheng , Yunhao Tang , Remi Munos , Gabriel Synnaeve

Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning

On-policy reinforcement learning (RL), particularly Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), has become the dominant paradigm for fine-tuning large language models (LLMs). While policy ratio clipping…

Machine Learning · Computer Science 2026-01-08 Yu Luo , Shuo Han , Yihan Hu , Dong Li , Jianye Hao

GAPO: Robust Advantage Estimation for Real-World Code LLMs

Reinforcement learning (RL) is widely used for post-training large language models (LLMs) in code editing, where group-relative methods, such as GRPO, are popular due to their critic-free and normalized advantage estimation. However, in…

Machine Learning · Computer Science 2026-01-09 Jianqing Zhang , Zhezheng Hao , Wei Xia , Hande Dong , Hong Wang , Chenxing Wei , Yuyan Zhou , Yubin Qi , Qiang Lin , Jian Cao

Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning

Reinforcement Learning, particularly through policy gradient methods, has played a central role in enabling reasoning capabilities of Large Language Models. However, the optimization stability of policy gradients in this setting remains…

Machine Learning · Computer Science 2026-03-03 Luckeciano C. Melo , Alessandro Abate , Yarin Gal

ACPO: Adaptive Curriculum Policy Optimization for Aligning Vision-Language Models in Complex Reasoning

Aligning large-scale vision-language models (VLMs) for complex reasoning via reinforcement learning is often hampered by the limitations of existing policy optimization algorithms, such as static training schedules and the rigid, uniform…

Artificial Intelligence · Computer Science 2025-10-02 Yunhao Wang , Ziting Li , Shuai Chen , Tao Liu , Chao Song , Junjie Jiang , Jian Zhu , Peng Gao , Bin Qin