Related papers: APO: Alpha-Divergence Preference Optimization

ADPO: Anchored Direct Preference Optimization

We present Anchored Direct Preference Optimization (ADPO), a policy alignment method derived from first principles of KL-regularized reinforcement learning. Unlike standard approaches that treat the reference policy merely as a regularizer,…

Machine Learning · Computer Science 2026-01-13 Wang Zixian

Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization

DPO (Direct Preference Optimization) has become a widely used offline preference optimization algorithm due to its simplicity and training stability. However, DPO is prone to overfitting and collapse. To address these challenges, we propose…

Machine Learning · Computer Science 2025-08-26 Rui Wang , Qianguo Sun , Chao Song , Junlong Wu , Tianrong Chen , Zhiyun Zeng , Yu Li

AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization

Aligning large language models (LLMs) with human values and intentions is crucial for their utility, honesty, and safety. Reinforcement learning from human feedback (RLHF) is a popular approach to achieve this alignment, but it faces…

Machine Learning · Computer Science 2025-07-22 Junkang Wu , Xue Wang , Zhengyi Yang , Jiancan Wu , Jinyang Gao , Bolin Ding , Xiang Wang , Xiangnan He

AM-PPO: (Advantage) Alpha-Modulation with Proximal Policy Optimization

Proximal Policy Optimization (PPO) is a widely used reinforcement learning algorithm that heavily relies on accurate advantage estimates for stable and efficient training. However, raw advantage signals can exhibit significant variance,…

Machine Learning · Computer Science 2025-05-22 Soham Sane

$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization

Preference optimization has made significant progress recently, with numerous methods developed to align language models with human preferences. This paper introduces $f$-divergence Preference Optimization ($f$-PO), a novel framework that…

Computation and Language · Computer Science 2025-02-18 Jiaqi Han , Mingjian Jiang , Yuxuan Song , Stefano Ermon , Minkai Xu

Learning Where It Matters: Geometric Anchoring for Robust Preference Alignment

Direct Preference Optimization (DPO) and related methods align large language models from pairwise preferences by regularizing updates against a fixed reference policy. As the policy drifts, a static reference, however, can become…

Machine Learning · Computer Science 2026-05-18 Youngjae Cho , Jongsuk Kim , Ji-Hoon Kim

Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game

Human preference alignment is essential to improve the interaction quality of large language models (LLMs). Existing alignment methods depend on manually annotated preference data to guide the LLM optimization directions. However,…

Computation and Language · Computer Science 2024-06-04 Pengyu Cheng , Yifan Yang , Jian Li , Yong Dai , Tianhao Hu , Peixin Cao , Nan Du , Xiaolong Li

Adaptive Preference Optimization with Uncertainty-aware Utility Anchor

Offline preference optimization methods are efficient for large language models (LLMs) alignment. Direct Preference optimization (DPO)-like learning, one of the most popular approaches, stands out for its efficiency in reward modeling.…

Machine Learning · Computer Science 2026-05-26 Xiaobo Wang , Zixia Jia , Jiaqi Li , Qi Liu , Zilong Zheng

A Practical Analysis of Human Alignment with *PO

At the forefront of state-of-the-art human alignment methods are preference optimization methods (*PO). Prior research has often concentrated on identifying the best-performing method, typically involving a grid search over hyperparameters,…

Computation and Language · Computer Science 2025-04-30 Kian Ahrabian , Xihui Lin , Barun Patra , Vishrav Chaudhary , Alon Benhaim , Jay Pujara , Xia Song

Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers

Efficiently aligning large-scale video diffusion models with human intent requires a scalable and trajectory-aware pathway that bridges the inherent discrepancy between training noise distributions and practical inference trajectories.…

Computer Vision and Pattern Recognition · Computer Science 2026-05-11 Jingyuan Zhu , Biaolong Chen , Le Zhang , Aixi Zhang , Hao Jiang , Pipei Huang

Learning to Align Human Code Preferences

Large Language Models (LLMs) have demonstrated remarkable potential in automating software development tasks. While recent advances leverage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align models with human…

Software Engineering · Computer Science 2025-12-09 Xin Yin , Chao Ni , Xiaohu Yang

ANO: A Principled Approach to Robust Policy Optimization

Proximal Policy Optimization (PPO) dominates reinforcement learning and LLM alignment but relies on a "hard clipping" mechanism that discards valuable gradients. Conversely, unconstrained methods like SPO expose the optimization to…

Artificial Intelligence · Computer Science 2026-05-07 Yiheng Zhang , Yiming Wang , Kaiyan Zhao , Zhenglin Wan , Jiayu Chen , Leong Hou U

Adversarial Constrained Policy Optimization: Improving Constrained Reinforcement Learning by Adapting Budgets

Constrained reinforcement learning has achieved promising progress in safety-critical fields where both rewards and constraints are considered. However, constrained reinforcement learning methods face challenges in striking the right…

Machine Learning · Computer Science 2024-10-29 Jianmina Ma , Jingtian Ji , Yue Gao

Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment

The rapid development of large language model (LLM) alignment algorithms has resulted in a complex and fragmented landscape, with limited clarity on the effectiveness of different methods and their inter-connections. This paper introduces…

Machine Learning · Computer Science 2025-02-11 Shengyang Sun , Yian Zhang , Alexander Bukharin , David Mosallanezhad , Jiaqi Zeng , Soumye Singhal , Gerald Shen , Adithya Renduchintala , Tugrul Konuk , Yi Dong , Zhilin Wang , Dmitry Chichkov , Olivier Delalleau , Oleksii Kuchaiev

Divergence Minimization Preference Optimization for Diffusion Model Alignment

Diffusion models have achieved remarkable success in generating realistic and versatile images from text prompts. Inspired by the recent advancements of language models, there is an increasing interest in further improving the models by…

Computer Vision and Pattern Recognition · Computer Science 2025-10-07 Binxu Li , Minkai Xu , Jiaqi Han , Meihua Dang , Stefano Ermon

Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints

The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but concurrently amplify safety concerns, such as potential misuse of AI systems, necessitating effective AI alignment.…

Machine Learning · Computer Science 2023-09-29 Chaoqi Wang , Yibo Jiang , Chenghao Yang , Han Liu , Yuxin Chen

Multi-Objective Reward and Preference Optimization: Theory and Algorithms

This thesis develops theoretical frameworks and algorithms that advance constrained reinforcement learning (RL) across control, preference learning, and alignment of large language models. The first contribution addresses constrained Markov…

Machine Learning · Computer Science 2025-12-12 Akhil Agnihotri

SIPO: Stabilized and Improved Preference Optimization for Aligning Diffusion Models

Preference learning has garnered extensive attention as an effective technique for aligning diffusion models with human preferences in visual generation. However, existing alignment approaches such as Diffusion-DPO suffer from two…

Machine Learning · Computer Science 2026-05-19 Xiaomeng Yang , Mengping Yang , Junyan Wang , Zhijian Zhou , Zhiyu Tan , Hao Li

The Importance of Online Data: Understanding Preference Fine-tuning via Coverage

Learning from human preference data has emerged as the dominant paradigm for fine-tuning large language models (LLMs). The two most common families of techniques -- online reinforcement learning (RL) such as Proximal Policy Optimization…

Machine Learning · Computer Science 2024-07-17 Yuda Song , Gokul Swamy , Aarti Singh , J. Andrew Bagnell , Wen Sun

Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing

Direct Preference Optimization (DPO) improves the alignment of large language models (LLMs) with human values by training directly on human preference datasets, eliminating the need for reward models. However, due to the presence of…

Artificial Intelligence · Computer Science 2024-06-11 Biqing Qi , Pengfei Li , Fangyuan Li , Junqi Gao , Kaiyan Zhang , Bowen Zhou