English
Related papers

Related papers: Implicit Turn-Wise Policy Optimization for Proacti…

200 papers

Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require…

Computation and Language · Computer Science 2026-03-25 Guoqing Wang , Sunhao Dai , Guangze Ye , Zeyu Gan , Wei Yao , Yong Deng , Xiaofeng Wu , Zhenzhe Ying

Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward…

Artificial Intelligence · Computer Science 2026-03-03 Fanqi Kong , Jiayi Zhang , Mingyi Deng , Chenglin Wu , Yuyu Luo , Bang Liu

Training Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR) - where models iteratively reason, generate code, and verify through execution - remains challenging for existing reinforcement learning (RL) approaches.…

Machine Learning · Computer Science 2026-04-21 Yifeng Ding , Hung Le , Songyang Han , Kangrui Ruan , Zhenghui Jin , Varun Kumar , Zijian Wang , Anoop Deoras

Large Language Models (LLMs) in multi-turn conversations often suffer from a ``lost-in-conversation'' phenomenon, where they struggle to recover from early incorrect assumptions, particularly when users provide ambiguous initial…

Computation and Language · Computer Science 2026-01-23 Zhebo Wang , Xiaohu Mu , Zijie Zhou , Mohan Li , Wenpeng Xing , Dezhang Kong , Meng Han

Optimizing large language models (LLMs) for multi-turn conversational outcomes remains a significant challenge, especially in goal-oriented settings like AI marketing or sales agents who facilitate transactions via messaging platforms. The…

Machine Learning · Computer Science 2025-11-27 Daniel R. Jiang , Jalaj Bhandari , Yukai Yang , Rémi Munos , Tyler Lu

While astonishingly capable, large Language Models (LLM) can sometimes produce outputs that deviate from human expectations. Such deviations necessitate an alignment phase to prevent disseminating untruthful, toxic, or biased information.…

Artificial Intelligence · Computer Science 2024-10-30 Long Tan Le , Han Shu , Tung-Anh Nguyen , Choong Seon Hong , Nguyen H. Tran

Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns…

Computation and Language · Computer Science 2025-01-23 Yafu Li , Xuyang Hu , Xiaoye Qu , Linjie Li , Yu Cheng

Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn…

Machine Learning · Computer Science 2026-01-27 Junbo Li , Peng Zhou , Rui Meng , Meet P. Vadera , Lihong Li , Yang Li

Proximal Policy Optimization (PPO) is commonly used in Reinforcement Learning from Human Feedback to align large language models (LLMs) with downstream tasks. This paper investigates the feasibility of using PPO for direct reinforcement…

Computation and Language · Computer Science 2024-10-23 Alexander G. Padula , Dennis J. N. J. Soemers

Direct Preference Optimization (DPO) has become a prominent method for aligning Large Language Models (LLMs) with human preferences. While DPO has enabled significant progress in aligning English LLMs, multilingual preference alignment is…

Computation and Language · Computer Science 2025-06-06 Wen Yang , Junhong Wu , Chen Wang , Chengqing Zong , Jiajun Zhang

Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when…

Computation and Language · Computer Science 2025-05-20 Zae Myung Kim , Chanwoo Park , Vipul Raheja , Suin Kim , Dongyeop Kang

Aligning large language models (LLMs) with human preferences is commonly done via reinforcement learning from human feedback (RLHF) with Proximal Policy Optimization (PPO) or, more simply, via Direct Preference Optimization (DPO). While DPO…

Artificial Intelligence · Computer Science 2026-05-04 Abdulhady Abas Abdullah , Fatemeh Daneshfar , Seyedali Mirjalili , Mourad Oussalah

Exploration is essential in reinforcement learning as an agent relies on trial and error to learn an optimal policy. However, when rewards are sparse, naive exploration strategies, like noise injection, are often insufficient. Intrinsic…

Machine Learning · Computer Science 2026-01-30 Minjae Cho , Huy Trong Tran

Large Language Models (LLMs) can acquire extensive world knowledge through pre-training on large corpora. However, due to exposure to low-quality data, LLMs may exhibit harmful behavior without aligning with human values. The dominant…

Machine Learning · Computer Science 2023-10-11 Tianhao Wu , Banghua Zhu , Ruoyu Zhang , Zhaojin Wen , Kannan Ramchandran , Jiantao Jiao

Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions.…

Computation and Language · Computer Science 2026-05-08 Dingwei Chen , Zefang Zong , Zhipeng Ma , Leo Luo , Yang Li , Chengming Li , Peng Chen , Jie Jiang

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capability of large language models (LLMs), enabling autonomous agents that can conduct effective multi-turn and tool-integrated reasoning. While instructions…

Machine Learning · Computer Science 2026-02-03 Han Zhou , Xingchen Wan , Ivan Vulić , Anna Korhonen

Reinforcement learning for multi-step reasoning with large language models (LLMs) typically relies on sparse terminal rewards, which creates a poorly conditioned credit-assignment problem: the final feedback is propagated uniformly across…

Machine Learning · Computer Science 2026-05-26 Fei Ding , Yongkang Zhang , youwei wang , Zijian Zeng

In this paper, we tackle the challenging problem of delayed rewards in reinforcement learning (RL). While Proximal Policy Optimization (PPO) has emerged as a leading Policy Gradient method, its performance can degrade under delayed rewards.…

With the rapid advancement of large language models and vision-language models, employing large models as Web Agents has become essential for automated web interaction. However, training Web Agents with reinforcement learning faces critical…

Machine Learning · Computer Science 2025-09-22 Ziyuan Chen , Zhenghui Zhao , Zhangye Han , Miancan Liu , Xianhang Ye , Yiqing Li , Hongbo Min , Jinkui Ren , Xiantao Zhang , Guitao Cao

Inverse Reinforcement Learning (IRL) learns a reward function to explain expert demonstrations. Modern IRL methods often use the adversarial (minimax) formulation that alternates between reward and policy optimization, which often lead to…

Machine Learning · Computer Science 2025-10-14 Yang Chen , Menglin Zou , Jiaqi Zhang , Yitan Zhang , Junyi Yang , Gael Gendron , Libo Zhang , Jiamou Liu , Michael J. Witbrock
‹ Prev 1 2 3 10 Next ›