Related papers: Exploration-Driven Optimization for Test-Time Larg…

Poly-EPO: Training Exploratory Reasoning Models

Exploration is a cornerstone of learning from experience: it enables agents to find solutions to complex problems, generalize to novel ones, and scale performance with test-time compute. In this paper, we present a framework for…

Artificial Intelligence · Computer Science 2026-05-06 Ifdita Hasan Orney , Jubayer Ibn Hamid , Shreya S Ramanujam , Shirley Wu , Hengyuan Hu , Noah Goodman , Dorsa Sadigh , Chelsea Finn

Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with…

Computation and Language · Computer Science 2025-07-29 Songjun Tu , Jiahao Lin , Xiangyu Tian , Qichao Zhang , Linjing Li , Yuqian Fu , Nan Xu , Wei He , Xiangyuan Lan , Dongmei Jiang , Dongbin Zhao

EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance

Large language models (LLMs) have recently advanced in reasoning when optimized with reinforcement learning (RL) under verifiable rewards. Existing methods primarily rely on outcome-based supervision to strengthen internal LLM reasoning,…

Artificial Intelligence · Computer Science 2026-05-29 Siyao Song , Cong Ma , Zhihao Cheng , Shiye Lei , Minghao Li , Ying Zeng , Huaixiao Tou , Kai Jia

DGRO: Enhancing LLM Reasoning via Exploration-Exploitation Control and Reward Variance Management

Inference scaling further accelerates Large Language Models (LLMs) toward Artificial General Intelligence (AGI), with large-scale Reinforcement Learning (RL) to unleash long Chain-of-Thought reasoning. Most contemporary reasoning approaches…

Machine Learning · Computer Science 2025-05-20 Xuerui Su , Liya Guo , Yue Wang , Yi Zhu , Zhiming Ma , Zun Wang , Yuting Liu

Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents

Large Language Models (LLMs) have become integral components in various autonomous agent systems. In this study, we present an exploration-based trajectory optimization approach, referred to as ETO. This learning method is designed to…

Computation and Language · Computer Science 2024-07-11 Yifan Song , Da Yin , Xiang Yue , Jie Huang , Sujian Li , Bill Yuchen Lin

ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning

Self-improvement via RL often fails on complex reasoning tasks because GRPO-style post-training methods rely on the model's initial ability to generate positive samples. Without guided exploration, these approaches merely reinforce what the…

Machine Learning · Computer Science 2026-01-28 Ruiyang Zhou , Shuozhe Li , Amy Zhang , Liu Leqi

EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework

Recent advances in reinforcement learning (RL) have significantly enhanced the reasoning capabilities of large language models (LLMs). Group Relative Policy Optimization (GRPO), a lightweight variant of Proximal Policy Optimization (PPO),…

Machine Learning · Computer Science 2025-10-13 Chen Wang , Lai Wei , Yanzhi Zhang , Chenyang Shao , Zedong Dan , Weiran Huang , Yuzhi Zhang , Yue Wang

XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation

Reinforcement learning algorithms such as GRPO have driven recent advances in large language model (LLM) reasoning. While scaling the number of rollouts stabilizes training, existing approaches suffer from limited exploration on challenging…

Machine Learning · Computer Science 2026-05-26 Udbhav Bamba , Minghao Fang , Yifan Yu , Haizhong Zheng , Fan Lai

Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs). The Group Relative Policy Optimization (GRPO) family has demonstrated strong…

Computation and Language · Computer Science 2025-11-10 Chenxi Liu , Junjie Liang , Yuqi Jia , Bochuan Cao , Yang Bai , Heng Huang , Xun Chen

Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization

Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs), as they potentially allow higher inference throughput. Reinforcement learning (RL) is a crucial component for dLLMs to…

Machine Learning · Computer Science 2026-02-24 Yuchen Zhu , Wei Guo , Jaemoo Choi , Petr Molodyk , Bo Yuan , Molei Tao , Yongxin Chen

EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning

Large Language Models (LLMs) have shown impressive reasoning capabilities in well-defined problems with clear solutions, such as mathematics and coding. However, they still struggle with complex real-world scenarios like business…

Computation and Language · Computer Science 2025-05-29 Xiaoqian Liu , Ke Wang , Yongbin Li , Yuchuan Wu , Wentao Ma , Aobo Kong , Fei Huang , Jianbin Jiao , Junge Zhang

Can GRPO Help LLMs Transcend Their Pretraining Origin?

Reinforcement Learning with Verifiable Rewards (RLVR), primarily driven by the Group Relative Policy Optimization (GRPO) algorithm, is a leading approach for enhancing the reasoning abilities of Large Language Models (LLMs). Despite its…

Machine Learning · Computer Science 2025-10-21 Kangqi Ni , Zhen Tan , Zijie Liu , Pingzhi Li , Tianlong Chen

Reasoning through Exploration: A Reinforcement Learning Framework for Robust Function Calling

The effective training of Large Language Models (LLMs) for function calling faces a critical challenge: balancing exploration of complex reasoning paths with stable policy optimization. Standard methods like Supervised Fine-Tuning (SFT)…

Machine Learning · Computer Science 2025-10-13 Bingguang Hao , Zengzhuang Xu , Maolin Wang , Yuntao Wen , Yicheng Chen , Cunyin Peng , Long Chen , Dong Wang , Xiangyu Zhao , Jinjie Gu , Chenyi Zhuang , Ji Zhang

Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning

While Reinforcement Learning (RL) has advanced LLM reasoning, applying it to long-context scenarios is hindered by sparsity of outcome rewards. This limitation fails to penalize ungrounded "lucky guesses," leaving the critical process of…

Artificial Intelligence · Computer Science 2026-04-21 Xin Guan , Zijian Li , Shen Huang , Pengjun Xie , Jingren Zhou , Jiuxin Cao

Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning

Recent progress in Large Language Model (LLM) reasoning is increasingly driven by the refinement of post-training loss functions and alignment strategies. However, standard Reinforcement Learning (RL) paradigms like Group Relative Policy…

Machine Learning · Computer Science 2026-01-28 Kishan Panaganti , Zhenwen Liang , Wenhao Yu , Haitao Mi , Dong Yu

EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

Balancing exploration and exploitation remains a central challenge in reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs). Current RLVR methods often overemphasize exploitation, leading to entropy…

Computation and Language · Computer Science 2026-04-14 Liang Chen , Xueting Han , Qizhou Wang , Bo Han , Jing Bai , Hinrich Schutze , Kam-Fai Wong

Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization

Diffusion language models (DLMs) enable parallel, order-agnostic generation with iterative refinement, offering a flexible alternative to autoregressive large language models (LLMs). However, adapting reinforcement learning (RL) fine-tuning…

Machine Learning · Computer Science 2026-02-12 Kevin Rojas , Jiahe Lin , Kashif Rasul , Anderson Schneider , Yuriy Nevmyvaka , Molei Tao , Wei Deng

Representation-Based Exploration for Language Models: From Test-Time to Post-Training

Reinforcement learning (RL) promises to expand the capabilities of language models, but it is unclear if current RL techniques promote the discovery of novel behaviors, or simply sharpen those already present in the base model. In this…

Machine Learning · Computer Science 2025-10-14 Jens Tuyls , Dylan J. Foster , Akshay Krishnamurthy , Jordan T. Ash

Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training

Recent advancements in reasoning-focused language models such as OpenAI's O1 and DeepSeek-R1 have shown that scaling test-time computation-through chain-of-thought reasoning and iterative exploration-can yield substantial improvements on…

Machine Learning · Computer Science 2025-07-18 Mingjie Liu , Shizhe Diao , Jian Hu , Ximing Lu , Xin Dong , Hao Zhang , Alexander Bukharin , Shaokun Zhang , Jiaqi Zeng , Makesh Narsimhan Sreedhar , Gerald Shen , David Mosallanezhad , Di Zhang , Jonas Yang , June Yang , Oleksii Kuchaiev , Guilin Liu , Zhiding Yu , Pavlo Molchanov , Yejin Choi , Jan Kautz , Yi Dong

Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective

Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood…

Computation and Language · Computer Science 2025-12-04 Jingyang Ou , Jiaqi Han , Minkai Xu , Shaoxuan Xu , Jianwen Xie , Stefano Ermon , Yi Wu , Chongxuan Li