Related papers: Implicit Turn-Wise Policy Optimization for Proacti…

Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents

Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require…

Computation and Language · Computer Science 2026-03-25 Guoqing Wang , Sunhao Dai , Guangze Ye , Zeyu Gan , Wei Yao , Yong Deng , Xiaofeng Wu , Zhenzhe Ying

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward…

Artificial Intelligence · Computer Science 2026-03-03 Fanqi Kong , Jiayi Zhang , Mingyi Deng , Chenglin Wu , Yuyu Luo , Bang Liu

Empowering Multi-Turn Tool-Integrated Agentic Reasoning with Group Turn Policy Optimization

Training Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR) - where models iteratively reason, generate code, and verify through execution - remains challenging for existing reinforcement learning (RL) approaches.…

Machine Learning · Computer Science 2026-04-21 Yifeng Ding , Hung Le , Songyang Han , Kangrui Ruan , Zhenghui Jin , Varun Kumar , Zijian Wang , Anoop Deoras

ICPO: Illocution-Calibrated Policy Optimization for Multi-Turn Conversation

Large Language Models (LLMs) in multi-turn conversations often suffer from a ``lost-in-conversation'' phenomenon, where they struggle to recover from early incorrect assumptions, particularly when users provide ambiguous initial…

Computation and Language · Computer Science 2026-01-23 Zhebo Wang , Xiaohu Mu , Zijie Zhou , Mohan Li , Wenpeng Xing , Dezhang Kong , Meng Han

Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO

Optimizing large language models (LLMs) for multi-turn conversational outcomes remains a significant challenge, especially in goal-oriented settings like AI marketing or sales agents who facilitate transactions via messaging platforms. The…

Machine Learning · Computer Science 2025-11-27 Daniel R. Jiang , Jalaj Bhandari , Yukai Yang , Rémi Munos , Tyler Lu

$i$REPO: $i$mplicit Reward Pairwise Difference based Empirical Preference Optimization

While astonishingly capable, large Language Models (LLM) can sometimes produce outputs that deviate from human expectations. Such deviations necessitate an alignment phase to prevent disseminating untruthful, toxic, or biased information.…

Artificial Intelligence · Computer Science 2024-10-30 Long Tan Le , Han Shu , Tung-Anh Nguyen , Choong Seon Hong , Nguyen H. Tran

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns…

Computation and Language · Computer Science 2025-01-23 Yafu Li , Xuyang Hu , Xiaoye Qu , Linjie Li , Yu Cheng

Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn…

Machine Learning · Computer Science 2026-01-27 Junbo Li , Peng Zhou , Rui Meng , Meet P. Vadera , Lihong Li , Yang Li

Exploring RL-based LLM Training for Formal Language Tasks with Programmed Rewards

Proximal Policy Optimization (PPO) is commonly used in Reinforcement Learning from Human Feedback to align large language models (LLMs) with downstream tasks. This paper investigates the feasibility of using PPO for direct reinforcement…

Computation and Language · Computer Science 2024-10-23 Alexander G. Padula , Dennis J. N. J. Soemers

Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment

Direct Preference Optimization (DPO) has become a prominent method for aligning Large Language Models (LLMs) with human preferences. While DPO has enabled significant progress in aligning English LLMs, multilingual preference alignment is…

Computation and Language · Computer Science 2025-06-06 Wen Yang , Junhong Wu , Chen Wang , Chengqing Zong , Jiajun Zhang

Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models

Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when…

Computation and Language · Computer Science 2025-05-20 Zae Myung Kim , Chanwoo Park , Vipul Raheja , Suin Kim , Dongyeop Kang

TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization

Aligning large language models (LLMs) with human preferences is commonly done via reinforcement learning from human feedback (RLHF) with Proximal Policy Optimization (PPO) or, more simply, via Direct Preference Optimization (DPO). While DPO…

Artificial Intelligence · Computer Science 2026-05-04 Abdulhady Abas Abdullah , Fatemeh Daneshfar , Seyedali Mirjalili , Mourad Oussalah

Intrinsic Reward Policy Optimization for Sparse-Reward Environments

Exploration is essential in reinforcement learning as an agent relies on trial and error to learn an optimal policy. However, when rewards are sparse, naive exploration strategies, like noise injection, are often insufficient. Intrinsic…

Machine Learning · Computer Science 2026-01-30 Minjae Cho , Huy Trong Tran

Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment

Large Language Models (LLMs) can acquire extensive world knowledge through pre-training on large corpora. However, due to exposure to low-quality data, LLMs may exhibit harmful behavior without aligning with human values. The dominant…

Machine Learning · Computer Science 2023-10-11 Tianhao Wu , Banghua Zhu , Ruoyu Zhang , Zhaojin Wen , Kannan Ramchandran , Jiantao Jiao

A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions.…

Computation and Language · Computer Science 2026-05-08 Dingwei Chen , Zefang Zong , Zhipeng Ma , Leo Luo , Yang Li , Chengming Li , Peng Chen , Jie Jiang

Agentic Policy Optimization via Instruction-Policy Co-Evolution

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capability of large language models (LLMs), enabling autonomous agents that can conduct effective multi-turn and tool-integrated reasoning. While instructions…

Machine Learning · Computer Science 2026-02-03 Han Zhou , Xingchen Wan , Ivan Vulić , Anna Korhonen

Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

Reinforcement learning for multi-step reasoning with large language models (LLMs) typically relies on sparse terminal rewards, which creates a poorly conditioned credit-assignment problem: the final feedback is propagated uniformly across…

Machine Learning · Computer Science 2026-05-26 Fei Ding , Yongkang Zhang , youwei wang , Zijian Zeng

Accelerating Proximal Policy Optimization Learning Using Task Prediction for Solving Environments with Delayed Rewards

In this paper, we tackle the challenging problem of delayed rewards in reinforcement learning (RL). While Proximal Policy Optimization (PPO) has emerged as a leading Policy Gradient method, its performance can degrade under delayed rewards.…

Machine Learning · Computer Science 2024-12-06 Ahmad Ahmad , Mehdi Kermanshah , Kevin Leahy , Zachary Serlin , Ho Chit Siu , Makai Mann , Cristian-Ioan Vasile , Roberto Tron , Calin Belta

TGPO: Tree-Guided Preference Optimization for Robust Web Agent Reinforcement Learning

With the rapid advancement of large language models and vision-language models, employing large models as Web Agents has become essential for automated web interaction. However, training Web Agents with reinforcement learning faces critical…

Machine Learning · Computer Science 2025-09-22 Ziyuan Chen , Zhenghui Zhao , Zhangye Han , Miancan Liu , Xianhang Ye , Yiqing Li , Hongbo Min , Jinkui Ren , Xiantao Zhang , Guitao Cao

Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm

Inverse Reinforcement Learning (IRL) learns a reward function to explain expert demonstrations. Modern IRL methods often use the adversarial (minimax) formulation that alternates between reward and policy optimization, which often lead to…

Machine Learning · Computer Science 2025-10-14 Yang Chen , Menglin Zou , Jiaqi Zhang , Yitan Zhang , Junyi Yang , Gael Gendron , Libo Zhang , Jiamou Liu , Michael J. Witbrock