Related papers: Dual Approximation Policy Optimization

A Novel Framework for Policy Mirror Descent with General Parameterization and Linear Convergence

Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe their success to the use of parameterized policies. However, while theoretical guarantees have been established for this class of algorithms, especially…

Machine Learning · Statistics 2024-02-14 Carlo Alfano , Rui Yuan , Patrick Rebeschini

Improving DAPO from a Mixed-Policy Perspective

This paper introduces two novel modifications to the Dynamic sAmpling Policy Optimization (DAPO) algorithm [1], approached from a mixed-policy perspective. Standard policy gradient methods can suffer from instability and sample…

Machine Learning · Computer Science 2025-08-20 Hongze Tan , Yuchen Li

Preference Optimization by Estimating the Ratio of the Data Distribution

Direct preference optimization (DPO) is widely used as a simple and stable method for aligning large language models (LLMs) with human preferences. This paper investigates a generalized DPO loss that enables a policy model to match the…

Machine Learning · Computer Science 2025-10-28 Yeongmin Kim , Heesun Bae , Byeonghu Na , Il-Chul Moon

Beyond KL Divergence: Policy Optimization with Flexible Bregman Divergences for LLM Reasoning

Policy optimization methods like Group Relative Policy Optimization (GRPO) and its variants have achieved strong results on mathematical reasoning and code generation tasks. Despite extensive exploration of reward processing strategies and…

Machine Learning · Computer Science 2026-02-05 Rui Yuan , Mykola Khandoga , Vinay Kumar Sankarapu

Direct Distributional Optimization for Provable Alignment of Diffusion Models

We introduce a novel alignment method for diffusion models from distribution optimization perspectives while providing rigorous convergence guarantees. We first formulate the problem as a generic regularized loss minimization over…

Machine Learning · Computer Science 2025-03-07 Ryotaro Kawata , Kazusato Oko , Atsushi Nitanda , Taiji Suzuki

Mirror Descent Policy Optimization

Mirror descent (MD), a well-known first-order method in constrained convex optimization, has recently been shown as an important tool to analyze trust-region algorithms in reinforcement learning (RL). However, there remains a considerable…

Machine Learning · Computer Science 2021-06-08 Manan Tomar , Lior Shani , Yonathan Efroni , Mohammad Ghavamzadeh

Convergence of Policy Mirror Descent Beyond Compatible Function Approximation

Modern policy optimization methods roughly follow the policy mirror descent (PMD) algorithmic template, for which there are by now numerous theoretical convergence results. However, most of these either target tabular environments, or can…

Machine Learning · Computer Science 2025-07-08 Uri Sherman , Tomer Koren , Yishay Mansour

Dual Alignment Maximin Optimization for Offline Model-based RL

Offline reinforcement learning agents face significant deployment challenges due to the synthetic-to-real distribution mismatch. While most prior research has focused on improving the fidelity of synthetic sampling and incorporating…

Machine Learning · Computer Science 2025-10-01 Chi Zhou , Wang Luo , Haoran Li , Congying Han , Tiande Guo , Zicheng Zhang

A Statistical Framework for Alignment with Biased AI Feedback

Modern alignment pipelines are increasingly replacing expensive human preference labels with evaluations from large language models (LLM-as-Judge). However, AI labels can be systematically biased compared to high-quality human feedback…

Machine Learning · Statistics 2026-02-10 Xintao Xia , Zhiqiu Xia , Linjun Zhang , Zhanrui Cai

Policy Optimization over General State and Action Spaces

Reinforcement learning (RL) problems over general state and action spaces are notoriously challenging. In contrast to the tableau setting, one can not enumerate all the states and then iteratively update the policies for each state. This…

Machine Learning · Computer Science 2026-03-24 Caleb Ju , Guanghui Lan

Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy

Proximal policy optimization and trust region policy optimization (PPO and TRPO) with actor and critic parametrized by neural networks achieve significant empirical success in deep reinforcement learning. However, due to nonconvexity, the…

Machine Learning · Computer Science 2023-03-01 Boyi Liu , Qi Cai , Zhuoran Yang , Zhaoran Wang

Policy Optimization for Constrained MDPs with Provable Fast Global Convergence

We address the problem of finding the optimal policy of a constrained Markov decision process (CMDP) using a gradient descent-based algorithm. Previous results have shown that a primal-dual approach can achieve an $\mathcal{O}(1/\sqrt{T})$…

Machine Learning · Computer Science 2022-02-07 Tao Liu , Ruida Zhou , Dileep Kalathil , P. R. Kumar , Chao Tian

Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints

The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but concurrently amplify safety concerns, such as potential misuse of AI systems, necessitating effective AI alignment.…

Machine Learning · Computer Science 2023-09-29 Chaoqi Wang , Yibo Jiang , Chenghao Yang , Han Liu , Yuxin Chen

Proximal Policy Optimization with Relative Pearson Divergence

The recent remarkable progress of deep reinforcement learning (DRL) stands on regularization of policy for stable and efficient learning. A popular method, named proximal policy optimization (PPO), has been introduced for this purpose. PPO…

Machine Learning · Computer Science 2023-07-04 Taisuke Kobayashi

$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization

Preference optimization has made significant progress recently, with numerous methods developed to align language models with human preferences. This paper introduces $f$-divergence Preference Optimization ($f$-PO), a novel framework that…

Computation and Language · Computer Science 2025-02-18 Jiaqi Han , Mingjian Jiang , Yuxuan Song , Stefano Ermon , Minkai Xu

Discovered Policy Optimisation

Tremendous progress has been made in reinforcement learning (RL) over the past decade. Most of these advancements came through the continual development of new algorithms, which were designed using a combination of mathematical derivations,…

Machine Learning · Computer Science 2022-10-14 Chris Lu , Jakub Grudzien Kuba , Alistair Letcher , Luke Metz , Christian Schroeder de Witt , Jakob Foerster

Rethinking the Trust Region in LLM Reinforcement Learning

Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio…

Machine Learning · Computer Science 2026-05-27 Penghui Qi , Xiangxin Zhou , Zichen Liu , Tianyu Pang , Chao Du , Min Lin , Wee Sun Lee

Reparameterization Proximal Policy Optimization

By leveraging differentiable dynamics, Reparameterization Policy Gradient (RPG) achieves high sample efficiency. However, current approaches are hindered by two critical limitations: the under-utilization of computationally expensive…

Machine Learning · Computer Science 2026-02-09 Hai Zhong , Xun Wang , Zhuoran Li , Longbo Huang

Inducing Robustness in a 2 Dimensional Direct Preference Optimization Paradigm

Direct Preference Optimisation (DPO) has emerged as a powerful method for aligning Large Language Models (LLMs) with human preferences, offering a stable and efficient alternative to approaches that use Reinforcement learning via Human…

Artificial Intelligence · Computer Science 2025-05-06 Sarvesh Shashidhar , Ritik , Nachiketa Patil , Suraj Racha , Ganesh Ramakrishnan

Fast Convergence of Softmax Policy Mirror Ascent

Natural policy gradient (NPG) is a common policy optimization algorithm and can be viewed as mirror ascent in the space of probabilities. Recently, Vaswani et al. [2021] introduced a policy gradient method that corresponds to mirror ascent…

Machine Learning · Computer Science 2025-06-02 Reza Asad , Reza Babanezhad , Issam Laradji , Nicolas Le Roux , Sharan Vaswani