English
Related papers

Related papers: Agentic Entropy-Balanced Policy Optimization

200 papers

Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs can…

Reinforcement learning (RL) has emerged as an effective approach for enhancing the reasoning capabilities of large language models (LLMs), especially in scenarios where supervised fine-tuning (SFT) falls short due to limited…

Machine Learning · Computer Science 2026-04-15 Jian Xiong , Jingbo Zhou , Jingyong Ye , Qiang Huang , Dejing Dou

Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings--where stale data from past policies are used for training--improves…

On-policy reinforcement learning (RL) algorithms are widely used for their strong asymptotic performance and training stability, but they struggle to scale with larger batch sizes, as additional parallel environments yield redundant data…

Machine Learning · Computer Science 2025-11-13 Jianren Wang , Yifan Su , Abhinav Gupta , Deepak Pathak

Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical…

Machine Learning · Computer Science 2026-02-11 Wujiang Xu , Wentian Zhao , Zhenting Wang , Yu-Jhe Li , Can Jin , Mingyu Jin , Kai Mei , Kun Wan , Dimitris N. Metaxas

Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this…

Artificial Intelligence · Computer Science 2025-10-08 Zhuofeng Li , Haoxiang Zhang , Seungju Han , Sheng Liu , Jianwen Xie , Yu Zhang , Yejin Choi , James Zou , Pan Lu

General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large…

Computation and Language · Computer Science 2026-04-21 Daoyu Wang , Qingchuan Li , Mingyue Cheng , Jie Ouyang , Shuo Yu , Qi Liu , Enhong Chen

Agentic Reinforcement Learning (Agentic RL) has shown remarkable potential in large language model-based (LLM) agents. These works can empower LLM agents to tackle complex tasks via multi-step, tool-integrated reasoning. However, an…

Artificial Intelligence · Computer Science 2026-03-04 Siwei Zhang , Yun Xiong , Xi Chen , Zi'an Jia , Renhong Huang , Jiarong Xu , Jiawei Zhang

Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. Traditional approaches often depend on meticulously designed prompts, high-quality examples, or additional reward models for…

Machine Learning · Computer Science 2024-06-07 Muning Wen , Junwei Liao , Cheng Deng , Jun Wang , Weinan Zhang , Ying Wen

Large Language Model(LLM)-based agents have shown strong capabilities in web information seeking, with reinforcement learning (RL) becoming a key optimization paradigm. However, planning remains a bottleneck, as existing methods struggle…

Computation and Language · Computer Science 2026-01-08 Xinmiao Yu , Liwen Zhang , Xiaocheng Feng , Yong Jiang , Bing Qin , Pengjun Xie , Jingren Zhou

The policy represented by the deep neural network can overfit the spurious features in observations, which hamper a reinforcement learning agent from learning effective policy. This issue becomes severe in high-dimensional state, where the…

Machine Learning · Computer Science 2023-05-01 Md Masudur Rahman , Yexiang Xue

Large language models (LLMs) have recently advanced in reasoning when optimized with reinforcement learning (RL) under verifiable rewards. Existing methods primarily rely on outcome-based supervision to strengthen internal LLM reasoning,…

Artificial Intelligence · Computer Science 2026-05-29 Siyao Song , Cong Ma , Zhihao Cheng , Shiye Lei , Minghao Li , Ying Zeng , Huaixiao Tou , Kai Jia

LLM agents have emerged as powerful systems for tackling multi-turn tasks by interleaving internal reasoning and external tool interactions. Agentic Reinforcement Learning has recently drawn significant research attention as a critical…

Artificial Intelligence · Computer Science 2026-01-09 Zefang Zong , Dingwei Chen , Yang Li , Qi Yi , Bo Zhou , Chengming Li , Bo Qian , Peng Chen , Jie Jiang

The policy gradient method enjoys the simplicity of the objective where the agent optimizes the cumulative reward directly. Moreover, in the continuous action domain, parameterized distribution of action distribution allows easy control of…

Machine Learning · Computer Science 2022-12-16 Md Masudur Rahman , Yexiang Xue

Reinforcement learning (RL) has emerged as the predominant paradigm for training large language model (LLM)-based AI agents. However, existing backbone RL algorithms lack verified convergence guarantees in agentic scenarios, especially in…

Artificial Intelligence · Computer Science 2026-02-09 Tianyi Hu , Qingxu Fu , Yanxi Chen , Zhaoyang Liu , Bolin Ding

As single-center computing approaches power constraints, decentralized training becomes essential. However, traditional Reinforcement Learning (RL) methods, crucial for enhancing large model post-training, cannot adapt to decentralized…

Reinforcement Learning (RL) for constrained MDPs (CMDPs) is an increasingly important problem for various applications. Often, the average criterion is more suitable than the discounted criterion. Yet, RL for average-CMDPs (ACMDPs) remains…

Machine Learning · Computer Science 2024-05-27 Akhil Agnihotri , Rahul Jain , Haipeng Luo

Reinforcement learning (RL) has substantially improved the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. However, effective agentic RL remains challenging: sparse outcome-only rewards…

Enhancing LLMs with the ability to actively search external knowledge is crucial for complex and real-world tasks. Current approaches either rely on prompting to elicit the model's innate agent capabilities, or suffer from performance…

Computation and Language · Computer Science 2026-03-20 Chenyang Gu , Yewen Pu , Bruce Yang , Xiaofan Li , Huan Gao

For RL algorithms, appropriate entropy control is crucial to their effectiveness. To control the policy entropy, a commonly used method is entropy regularization, which is adopted in various popular RL algorithms including PPO, SAC and A3C.…

Machine Learning · Computer Science 2026-02-06 Han Shen
‹ Prev 1 2 3 10 Next ›