Related papers: Agentic Entropy-Balanced Policy Optimization

Agentic Reinforced Policy Optimization

Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs can…

Machine Learning · Computer Science 2025-07-29 Guanting Dong , Hangyu Mao , Kai Ma , Licheng Bao , Yifei Chen , Zhongyuan Wang , Zhongxia Chen , Jiazhen Du , Huiyang Wang , Fuzheng Zhang , Guorui Zhou , Yutao Zhu , Ji-Rong Wen , Zhicheng Dou

AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Margin

Reinforcement learning (RL) has emerged as an effective approach for enhancing the reasoning capabilities of large language models (LLMs), especially in scenarios where supervised fine-tuning (SFT) falls short due to limited…

Machine Learning · Computer Science 2026-04-15 Jian Xiong , Jingbo Zhou , Jingyong Ye , Qiang Huang , Dejing Dou

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings--where stale data from past policies are used for training--improves…

Machine Learning · Computer Science 2025-10-23 Zhiheng Xi , Xin Guo , Yang Nan , Enyu Zhou , Junrui Shen , Wenxiang Chen , Jiaqi Liu , Jixuan Huang , Zhihao Zhang , Honglin Guo , Xun Deng , Zhikai Lei , Miao Zheng , Guoteng Wang , Shuo Zhang , Peng Sun , Rui Zheng , Hang Yan , Tao Gui , Qi Zhang , Xuanjing Huang

Evolutionary Policy Optimization

On-policy reinforcement learning (RL) algorithms are widely used for their strong asymptotic performance and training stability, but they struggle to scale with larger batch sizes, as additional parallel environments yield redundant data…

Machine Learning · Computer Science 2025-11-13 Jianren Wang , Yifan Su , Abhinav Gupta , Deepak Pathak

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical…

Machine Learning · Computer Science 2026-02-11 Wujiang Xu , Wentian Zhao , Zhenting Wang , Yu-Jhe Li , Can Jin , Mingyu Jin , Kai Mei , Kun Wan , Dimitris N. Metaxas

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this…

Artificial Intelligence · Computer Science 2025-10-08 Zhuofeng Li , Haoxiang Zhang , Seungju Han , Sheng Liu , Jianwen Xie , Yu Zhang , Yejin Choi , James Zou , Pan Lu

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large…

Computation and Language · Computer Science 2026-04-21 Daoyu Wang , Qingchuan Li , Mingyue Cheng , Jie Ouyang , Shuo Yu , Qi Liu , Enhong Chen

RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization

Agentic Reinforcement Learning (Agentic RL) has shown remarkable potential in large language model-based (LLM) agents. These works can empower LLM agents to tackle complex tasks via multi-step, tool-integrated reasoning. However, an…

Artificial Intelligence · Computer Science 2026-03-04 Siwei Zhang , Yun Xiong , Xi Chen , Zi'an Jia , Renhong Huang , Jiarong Xu , Jiawei Zhang

Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement

Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. Traditional approaches often depend on meticulously designed prompts, high-quality examples, or additional reward models for…

Machine Learning · Computer Science 2024-06-07 Muning Wen , Junwei Liao , Cheng Deng , Jun Wang , Weinan Zhang , Ying Wen

WebAnchor: Anchoring Agent Planning to Stabilize Long-Horizon Web Reasoning

Large Language Model(LLM)-based agents have shown strong capabilities in web information seeking, with reinforcement learning (RL) becoming a key optimization paradigm. However, planning remains a bottleneck, as existing methods struggle…

Computation and Language · Computer Science 2026-01-08 Xinmiao Yu , Liwen Zhang , Xiaocheng Feng , Yong Jiang , Bing Qin , Pengjun Xie , Jingren Zhou

Adversarial Policy Optimization in Deep Reinforcement Learning

The policy represented by the deep neural network can overfit the spurious features in observations, which hamper a reinforcement learning agent from learning effective policy. This issue becomes severe in high-dimensional state, where the…

Machine Learning · Computer Science 2023-05-01 Md Masudur Rahman , Yexiang Xue

EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance

Large language models (LLMs) have recently advanced in reasoning when optimized with reinforcement learning (RL) under verifiable rewards. Existing methods primarily rely on outcome-based supervision to strengthen internal LLM reasoning,…

Artificial Intelligence · Computer Science 2026-05-29 Siyao Song , Cong Ma , Zhihao Cheng , Shiye Lei , Minghao Li , Ying Zeng , Huaixiao Tou , Kai Jia

AT$^2$PO: Agentic Turn-based Policy Optimization via Tree Search

LLM agents have emerged as powerful systems for tackling multi-turn tasks by interleaving internal reasoning and external tool interactions. Agentic Reinforcement Learning has recently drawn significant research attention as a critical…

Artificial Intelligence · Computer Science 2026-01-09 Zefang Zong , Dingwei Chen , Yang Li , Qi Yi , Bo Zhou , Chengming Li , Bo Qian , Peng Chen , Jie Jiang

Robust Policy Optimization in Deep Reinforcement Learning

The policy gradient method enjoys the simplicity of the objective where the agent optimizes the cumulative reward directly. Moreover, in the continuous action domain, parameterized distribution of action distribution allows easy control of…

Machine Learning · Computer Science 2022-12-16 Md Masudur Rahman , Yexiang Xue

SeeUPO: Sequence-Level Agentic-RL with Convergence Guarantees

Reinforcement learning (RL) has emerged as the predominant paradigm for training large language model (LLM)-based AI agents. However, existing backbone RL algorithms lack verified convergence guarantees in agentic scenarios, especially in…

Artificial Intelligence · Computer Science 2026-02-09 Tianyi Hu , Qingxu Fu , Yanxi Chen , Zhaoyang Liu , Bolin Ding

GEPO: Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning

As single-center computing approaches power constraints, decentralized training becomes essential. However, traditional Reinforcement Learning (RL) methods, crucial for enhancing large model post-training, cannot adapt to decentralized…

Machine Learning · Computer Science 2026-01-30 Han Zhang , Ruibin Zheng , Zexuan Yi , Zhuo Zhang , Hanyang Peng , Hui Wang , Zike Yuan , Cai Ke , Shiwei Chen , Jiacheng Yang , Yangning Li , Xiang Li , Jiangyue Yan , Yaoqi Liu , Liwen Jing , Jiayin Qi , Ruifeng Xu , Binxing Fang , Yue Yu

ACPO: A Policy Optimization Algorithm for Average MDPs with Constraints

Reinforcement Learning (RL) for constrained MDPs (CMDPs) is an increasingly important problem for various applications. Often, the average criterion is more suitable than the discounted criterion. Yet, RL for average-CMDPs (ACMDPs) remains…

Machine Learning · Computer Science 2024-05-27 Akhil Agnihotri , Rahul Jain , Haipeng Luo

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Reinforcement learning (RL) has substantially improved the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. However, effective agentic RL remains challenging: sparse outcome-only rewards…

Artificial Intelligence · Computer Science 2026-05-11 Haotian Zhao , Songlin Zhou , Yuxin Zhang , Stephen S. -T. Yau , Wenyu Zhang , Lun Tian , Tianshu Zhu , Yifeng Huang , Yucheng Zeng , Jingnan Gu , Daxiang Dong , Jianmin Wu

DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning

Enhancing LLMs with the ability to actively search external knowledge is crucial for complex and real-world tasks. Current approaches either rely on prompting to elicit the model's innate agent capabilities, or suffer from performance…

Computation and Language · Computer Science 2026-03-20 Chenyang Gu , Yewen Pu , Bruce Yang , Xiaofan Li , Huan Gao

On Entropy Control in LLM-RL Algorithms

For RL algorithms, appropriate entropy control is crucial to their effectiveness. To control the policy entropy, a commonly used method is entropy regularization, which is adopted in various popular RL algorithms including PPO, SAC and A3C.…

Machine Learning · Computer Science 2026-02-06 Han Shen