English
Related papers

Related papers: Stabilizing Reinforcement Learning for Diffusion L…

200 papers

Diffusion large language models (dLLMs), which offer a promising alternative to traditional autoregressive LLMs, have recently shown strong results in pretraining. However, due to their lack of tractable sequence-level likelihoods, they…

Machine Learning · Computer Science 2026-02-03 Anthony Zhan

Reinforcement learning (RL) has proven remarkably effective at improving the accuracy of language models in verifiable and deterministic domains like mathematics. Here, we examine if current RL methods are also effective at optimizing…

Machine Learning · Computer Science 2025-08-19 Michael Bereket , Jure Leskovec

Diffusion language models (DLMs) enable parallel, order-agnostic generation with iterative refinement, offering a flexible alternative to autoregressive large language models (LLMs). However, adapting reinforcement learning (RL) fine-tuning…

Machine Learning · Computer Science 2026-02-12 Kevin Rojas , Jiahe Lin , Kashif Rasul , Anderson Schneider , Yuriy Nevmyvaka , Molei Tao , Wei Deng

Group Relative Policy Optimization (GRPO) was introduced and used recently for promoting reasoning in LLMs under verifiable (binary) rewards. We show that the mean + variance calibration of these rewards induces a weighted contrastive loss…

Machine Learning · Computer Science 2025-10-22 Youssef Mroueh

Diffusion large language models (dLLMs) offer a promising route to parallel and efficient text generation, but improving their reasoning ability requires effective post-training. Reinforcement learning with verifiable rewards (RLVR) is a…

Computation and Language · Computer Science 2026-05-12 Zichao Yu , Shengze Xu , Bingqing Jiang , Wenyi Zhang , Difan Zou

Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio…

Machine Learning · Computer Science 2026-05-27 Penghui Qi , Xiangxin Zhou , Zichen Liu , Tianyu Pang , Chao Du , Min Lin , Wee Sun Lee

Reinforcement Learning with Verifiable Rewards (RLVR), primarily driven by the Group Relative Policy Optimization (GRPO) algorithm, is a leading approach for enhancing the reasoning abilities of Large Language Models (LLMs). Despite its…

Machine Learning · Computer Science 2025-10-21 Kangqi Ni , Zhen Tan , Zijie Liu , Pingzhi Li , Tianlong Chen

Group Relative Policy Optimization (GRPO) is a promising policy-based approach for Large Language Model alignment, yet its performance is often limited by training instability and suboptimal convergence. In this paper, we identify and…

Machine Learning · Computer Science 2025-12-12 Marco Simoni , Aleksandar Fontana , Giulio Rossolini , Andrea Saracino , Paolo Mori

Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with…

Artificial Intelligence · Computer Science 2026-02-10 Ali Hatamizadeh , Shrimai Prabhumoye , Igor Gitman , Ximing Lu , Seungju Han , Wei Ping , Yejin Choi , Jan Kautz

Large Language Models (LLMs) are increasingly deployed in business-critical domains such as finance, education, healthcare, and customer support, where users expect consistent and reliable recommendations. Yet LLMs often exhibit variability…

Machine Learning · Computer Science 2026-04-20 Sonal Prabhune , Balaji Padmanabhan , Kaushik Dutta

Reinforcement learning (RL) has become a key driver of language model reasoning. Among RL algorithms, Group Relative Policy Optimization (GRPO) is the de facto standard, avoiding the need for a critic by using per-prompt baselines and…

Machine Learning · Computer Science 2026-02-02 Cheng Ge , Caitlyn Heqi Yin , Hao Liang , Jiawei Zhang

Self-supervised reinforcement learning (RL) presents a promising approach for enhancing the reasoning capabilities of Large Language Models (LLMs) without reliance on expensive human-annotated data. However, we find that existing methods…

Artificial Intelligence · Computer Science 2025-12-16 Bizhe Bai , Hongming Wu , Peng Ye , Tao Chen

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly…

Machine Learning · Computer Science 2026-04-03 Gengsheng Li , Tianyu Yang , Junfeng Fang , Mingyang Song , Mao Zheng , Haiyun Guo , Dan Zhang , Jinqiao Wang , Tat-Seng Chua

While reinforcement learning methods such as Group Relative Preference Optimization (GRPO) have significantly enhanced Large Language Models, adapting them to diffusion models remains challenging. In particular, GRPO demands a stochastic…

Machine Learning · Computer Science 2025-10-10 Yihong Luo , Tianyang Hu , Jing Tang

Group Relative Policy Optimization (GRPO) has been a key driver of recent progress in reinforcement learning with verifiable rewards (RLVR) for large language models, but it is typically trained in a low-staleness, near-on-policy regime…

Machine Learning · Computer Science 2026-05-19 Minghao Tian , Yunfei Xie , Chen Wei

On-policy reinforcement learning (RL), particularly Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), has become the dominant paradigm for fine-tuning large language models (LLMs). While policy ratio clipping…

Machine Learning · Computer Science 2026-01-08 Yu Luo , Shuo Han , Yihan Hu , Dong Li , Jianye Hao

The evolution of Large Language Models (LLMs) has catalyzed a paradigm shift from superficial instruction following to rigorous long-horizon reasoning. While Group Relative Policy Optimization (GRPO) has emerged as a pivotal mechanism for…

Artificial Intelligence · Computer Science 2026-01-01 Xuan Xie , Xuan Wang , Wenjie Wang , Shuai Chen , Wei Lin

Group-Relative Policy Optimization (GRPO) is a key technique for training large reasoning models, yet it suffers from a critical vulnerability: the \emph{Think-Answer Mismatch}, where noisy reward signals corrupt the learning process. This…

Machine Learning · Computer Science 2025-08-11 Si Shen , Peijun Shen , Wenhua Zhao , Danhao Zhu

As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL)…

Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. However, RLVR training…

Machine Learning · Computer Science 2026-05-14 Tue Le , Linh Ngo Van , Trung Le
‹ Prev 1 2 3 10 Next ›