English
Related papers

Related papers: Dual Approximation Policy Optimization

200 papers

Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe their success to the use of parameterized policies. However, while theoretical guarantees have been established for this class of algorithms, especially…

Machine Learning · Statistics 2024-02-14 Carlo Alfano , Rui Yuan , Patrick Rebeschini

This paper introduces two novel modifications to the Dynamic sAmpling Policy Optimization (DAPO) algorithm [1], approached from a mixed-policy perspective. Standard policy gradient methods can suffer from instability and sample…

Machine Learning · Computer Science 2025-08-20 Hongze Tan , Yuchen Li

Direct preference optimization (DPO) is widely used as a simple and stable method for aligning large language models (LLMs) with human preferences. This paper investigates a generalized DPO loss that enables a policy model to match the…

Machine Learning · Computer Science 2025-10-28 Yeongmin Kim , Heesun Bae , Byeonghu Na , Il-Chul Moon

Policy optimization methods like Group Relative Policy Optimization (GRPO) and its variants have achieved strong results on mathematical reasoning and code generation tasks. Despite extensive exploration of reward processing strategies and…

Machine Learning · Computer Science 2026-02-05 Rui Yuan , Mykola Khandoga , Vinay Kumar Sankarapu

We introduce a novel alignment method for diffusion models from distribution optimization perspectives while providing rigorous convergence guarantees. We first formulate the problem as a generic regularized loss minimization over…

Machine Learning · Computer Science 2025-03-07 Ryotaro Kawata , Kazusato Oko , Atsushi Nitanda , Taiji Suzuki

Mirror descent (MD), a well-known first-order method in constrained convex optimization, has recently been shown as an important tool to analyze trust-region algorithms in reinforcement learning (RL). However, there remains a considerable…

Machine Learning · Computer Science 2021-06-08 Manan Tomar , Lior Shani , Yonathan Efroni , Mohammad Ghavamzadeh

Modern policy optimization methods roughly follow the policy mirror descent (PMD) algorithmic template, for which there are by now numerous theoretical convergence results. However, most of these either target tabular environments, or can…

Machine Learning · Computer Science 2025-07-08 Uri Sherman , Tomer Koren , Yishay Mansour

Offline reinforcement learning agents face significant deployment challenges due to the synthetic-to-real distribution mismatch. While most prior research has focused on improving the fidelity of synthetic sampling and incorporating…

Machine Learning · Computer Science 2025-10-01 Chi Zhou , Wang Luo , Haoran Li , Congying Han , Tiande Guo , Zicheng Zhang

Modern alignment pipelines are increasingly replacing expensive human preference labels with evaluations from large language models (LLM-as-Judge). However, AI labels can be systematically biased compared to high-quality human feedback…

Machine Learning · Statistics 2026-02-10 Xintao Xia , Zhiqiu Xia , Linjun Zhang , Zhanrui Cai

Reinforcement learning (RL) problems over general state and action spaces are notoriously challenging. In contrast to the tableau setting, one can not enumerate all the states and then iteratively update the policies for each state. This…

Machine Learning · Computer Science 2026-03-24 Caleb Ju , Guanghui Lan

Proximal policy optimization and trust region policy optimization (PPO and TRPO) with actor and critic parametrized by neural networks achieve significant empirical success in deep reinforcement learning. However, due to nonconvexity, the…

Machine Learning · Computer Science 2023-03-01 Boyi Liu , Qi Cai , Zhuoran Yang , Zhaoran Wang

We address the problem of finding the optimal policy of a constrained Markov decision process (CMDP) using a gradient descent-based algorithm. Previous results have shown that a primal-dual approach can achieve an $\mathcal{O}(1/\sqrt{T})$…

Machine Learning · Computer Science 2022-02-07 Tao Liu , Ruida Zhou , Dileep Kalathil , P. R. Kumar , Chao Tian

The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but concurrently amplify safety concerns, such as potential misuse of AI systems, necessitating effective AI alignment.…

Machine Learning · Computer Science 2023-09-29 Chaoqi Wang , Yibo Jiang , Chenghao Yang , Han Liu , Yuxin Chen

The recent remarkable progress of deep reinforcement learning (DRL) stands on regularization of policy for stable and efficient learning. A popular method, named proximal policy optimization (PPO), has been introduced for this purpose. PPO…

Machine Learning · Computer Science 2023-07-04 Taisuke Kobayashi

Preference optimization has made significant progress recently, with numerous methods developed to align language models with human preferences. This paper introduces $f$-divergence Preference Optimization ($f$-PO), a novel framework that…

Computation and Language · Computer Science 2025-02-18 Jiaqi Han , Mingjian Jiang , Yuxuan Song , Stefano Ermon , Minkai Xu

Tremendous progress has been made in reinforcement learning (RL) over the past decade. Most of these advancements came through the continual development of new algorithms, which were designed using a combination of mathematical derivations,…

Machine Learning · Computer Science 2022-10-14 Chris Lu , Jakub Grudzien Kuba , Alistair Letcher , Luke Metz , Christian Schroeder de Witt , Jakob Foerster

Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio…

Machine Learning · Computer Science 2026-05-27 Penghui Qi , Xiangxin Zhou , Zichen Liu , Tianyu Pang , Chao Du , Min Lin , Wee Sun Lee

By leveraging differentiable dynamics, Reparameterization Policy Gradient (RPG) achieves high sample efficiency. However, current approaches are hindered by two critical limitations: the under-utilization of computationally expensive…

Machine Learning · Computer Science 2026-02-09 Hai Zhong , Xun Wang , Zhuoran Li , Longbo Huang

Direct Preference Optimisation (DPO) has emerged as a powerful method for aligning Large Language Models (LLMs) with human preferences, offering a stable and efficient alternative to approaches that use Reinforcement learning via Human…

Artificial Intelligence · Computer Science 2025-05-06 Sarvesh Shashidhar , Ritik , Nachiketa Patil , Suraj Racha , Ganesh Ramakrishnan

Natural policy gradient (NPG) is a common policy optimization algorithm and can be viewed as mirror ascent in the space of probabilities. Recently, Vaswani et al. [2021] introduced a policy gradient method that corresponds to mirror ascent…

Machine Learning · Computer Science 2025-06-02 Reza Asad , Reza Babanezhad , Issam Laradji , Nicolas Le Roux , Sharan Vaswani
‹ Prev 1 2 3 10 Next ›