English
Related papers

Related papers: Soft Adaptive Policy Optimization

200 papers

A significant portion of recent research on Large Language Model (LLM) alignment focuses on developing new policy optimization methods based on Group Relative Policy Optimization (GRPO). Two prominent directions have emerged: (i) a shift…

Machine Learning · Computer Science 2026-02-27 Svetlana Glazyrina , Maksim Kryzhanovskiy , Roman Ischenko

Group Relative Policy Optimization (GRPO) has significantly advanced the training of large language models and enhanced their reasoning capabilities, while it remains susceptible to instability due to the use of hard clipping. Soft Adaptive…

Machine Learning · Computer Science 2026-03-26 Egor Denisov , Svetlana Glazyrina , Maksim Kryzhanovskiy , Roman Ischenko

Since DeepSeek-R1 popularized, Group Relative Policy Optimization (GRPO) has become the core part of training Reasoning LLMs. However, we find some deficiency that influences RL stability and inference efficiency, like zero-variance in…

Computation and Language · Computer Science 2025-09-30 Chen Li , Nazhou Liu , Kai Yang

As a key component of large language model (LLM) post-training, Reinforcement Learning from Verifiable Rewards (RLVR) has substantially improved reasoning performance. However, existing RLVR algorithms exhibit distinct stability issues:…

Computation and Language · Computer Science 2026-04-13 Kun Yang , Zikang chen , Yanmeng Wang , Zhigen Li , Ning Cheng , Shaojun Wang , Jing Xiao

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios,…

Machine Learning · Computer Science 2025-07-29 Chujie Zheng , Shixuan Liu , Mingze Li , Xiong-Hui Chen , Bowen Yu , Chang Gao , Kai Dang , Yuqiong Liu , Rui Men , An Yang , Jingren Zhou , Junyang Lin

Existing reinforcement learning approaches for Large Language Models typically perform policy optimization at the granularity of individual tokens or entire response sequences. However, such formulations often misalign with the natural…

Artificial Intelligence · Computer Science 2026-05-08 Lei Gao , Zhuoming Li , Mengxi Jia , Jiakang Yuan , Hongbo Sun , Hao Sun , Xuelong Li

Reinforcement learning (RL) has become a central component of post-training for large language models (LLMs), particularly for complex reasoning tasks that require stable optimization over long generation horizons. However, achieving…

Machine Learning · Computer Science 2026-02-17 Yuepeng Sheng , Yuwei Huang , Shuman Liu , Anxiang Zeng , Haibo Zhang

Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings--where stale data from past policies are used for training--improves…

Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from…

Machine Learning · Computer Science 2026-03-19 Ziyan Wang , Zheng Wang , Xingwei Qu , Qi Cheng , Jie Fu , Shengpu Tang , Minjia Zhang , Xiaoming Huo

Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: token-level…

Machine Learning · Computer Science 2025-10-22 Yiran Guo , Lijie Xu , Jie Liu , Dan Ye , Shuang Qiu

Off-policy updates are inevitable in reinforcement learning (RL) for large language models (LLMs) due to rollout staleness from asynchronous training and mismatches between training and inference engines. Naive importance sampling gives an…

Machine Learning · Computer Science 2026-05-11 Guobin Shen , Chenxiao Zhao , Xiang Cheng , Lei Huang , Xing Yu

Training large language models (LLMs) with reinforcement learning (RL) methods such as PPO and GRPO commonly relies on ratio clipping to stabilise updates. While effective at preventing instability, clipping discards information, introduces…

Machine Learning · Computer Science 2026-02-02 Madeleine Dwyer , Adam Sobey , Adriane Chapman

Recent advances in large language models (LLMs) have popularized test-time scaling, where models generate additional reasoning tokens before producing final answers. These approaches have demonstrated significant performance improvements on…

Artificial Intelligence · Computer Science 2026-01-13 Wenxun Wu , Yuanyang Li , Guhan Chen , Linyue Wang , Hongyang Chen

Group Relative Policy Optimization (GRPO) is a promising policy-based approach for Large Language Model alignment, yet its performance is often limited by training instability and suboptimal convergence. In this paper, we identify and…

Machine Learning · Computer Science 2025-12-12 Marco Simoni , Aleksandar Fontana , Giulio Rossolini , Andrea Saracino , Paolo Mori

Post-training with reinforcement learning (RL) has recently shown strong promise for advancing multimodal agents beyond supervised imitation. However, RL remains limited by poor data efficiency, particularly in settings where interaction…

Machine Learning · Computer Science 2026-03-05 Chengxuan Lu , Zhenquan Zhang , Shukuan Wang , Qunzhi Lin , Baigui Sun , Yang Liu

RL-based post-training of language models is almost exclusively done using on-policy methods such as PPO. These methods cannot learn from arbitrary sequences such as those produced earlier in training, in earlier runs, by human experts or…

Machine Learning · Computer Science 2025-03-10 Taco Cohen , David W. Zhang , Kunhao Zheng , Yunhao Tang , Remi Munos , Gabriel Synnaeve

On-policy reinforcement learning (RL), particularly Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), has become the dominant paradigm for fine-tuning large language models (LLMs). While policy ratio clipping…

Machine Learning · Computer Science 2026-01-08 Yu Luo , Shuo Han , Yihan Hu , Dong Li , Jianye Hao

Reinforcement learning (RL) is widely used for post-training large language models (LLMs) in code editing, where group-relative methods, such as GRPO, are popular due to their critic-free and normalized advantage estimation. However, in…

Machine Learning · Computer Science 2026-01-09 Jianqing Zhang , Zhezheng Hao , Wei Xia , Hande Dong , Hong Wang , Chenxing Wei , Yuyan Zhou , Yubin Qi , Qiang Lin , Jian Cao

Reinforcement Learning, particularly through policy gradient methods, has played a central role in enabling reasoning capabilities of Large Language Models. However, the optimization stability of policy gradients in this setting remains…

Machine Learning · Computer Science 2026-03-03 Luckeciano C. Melo , Alessandro Abate , Yarin Gal

Aligning large-scale vision-language models (VLMs) for complex reasoning via reinforcement learning is often hampered by the limitations of existing policy optimization algorithms, such as static training schedules and the rigid, uniform…

Artificial Intelligence · Computer Science 2025-10-02 Yunhao Wang , Ziting Li , Shuai Chen , Tao Liu , Chao Song , Junjie Jiang , Jian Zhu , Peng Gao , Bin Qin
‹ Prev 1 2 3 10 Next ›