Related papers: Understanding Likelihood Over-optimisation in Dire…

Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling

Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO) have emerged as alternatives to the standard Reinforcement Learning from Human Feedback (RLHF) for aligning large language models (LLMs) with human values.…

Machine Learning · Computer Science 2025-06-12 Phuc Minh Nguyen , Ngoc-Hieu Nguyen , Duy H. M. Nguyen , Anji Liu , An Mai , Binh T. Nguyen , Daniel Sonntag , Khoa D. Doan

Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization

Direct Preference Optimization (DPO) and its variants have become the de facto standards for aligning large language models (LLMs) with human preferences or specific goals. However, DPO requires high-quality preference data and suffers from…

Machine Learning · Computer Science 2024-11-12 Zhuotong Chen , Fang Liu , Jennifer Zhu , Wanyu Du , Yanjun Qi

Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints

The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but concurrently amplify safety concerns, such as potential misuse of AI systems, necessitating effective AI alignment.…

Machine Learning · Computer Science 2023-09-29 Chaoqi Wang , Yibo Jiang , Chenghao Yang , Han Liu , Yuxin Chen

Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective

Aligning the output of Large Language Models (LLMs) with human preferences (e.g., by means of reinforcement learning with human feedback, or RLHF) is essential for ensuring their effectiveness in real-world scenarios. Despite significant…

Artificial Intelligence · Computer Science 2024-10-23 Pietro Bernardelle , Gianluca Demartini

Improving Safety Alignment via Balanced Direct Preference Optimization

With the rapid development and widespread application of Large Language Models (LLMs), their potential safety risks have attracted widespread attention. Reinforcement Learning from Human Feedback (RLHF) has been adopted to enhance the…

Artificial Intelligence · Computer Science 2026-03-25 Shiji Zhao , Mengyang Wang , Shukun Xiong , Fangzhou Chen , Qihui Zhu , Shouwei Ruan , Yisong Xiao , Ranjie Duan , Xun Chen , XingXing Wei

From RLHF to Direct Alignment: A Theoretical Unification of Preference Learning for Large Language Models

Aligning large language models (LLMs) with human preferences has become essential for safe and beneficial AI deployment. While Reinforcement Learning from Human Feedback (RLHF) established the dominant paradigm, a proliferation of…

Artificial Intelligence · Computer Science 2026-01-13 Tarun Raheja , Nilay Pochhi

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather…

Artificial Intelligence · Computer Science 2026-05-21 Zhiqin Yang , Yonggang Zhang , Wei Xue , Dong Fang , Bo Han , Yike Guo

Less is More: Improving LLM Alignment via Preference Data Selection

Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. While prior work mainly extends DPO from the aspect of the objective function, we instead improve DPO from…

Machine Learning · Computer Science 2026-02-17 Xun Deng , Han Zhong , Rui Ai , Fuli Feng , Zheng Wang , Xiangnan He

A Survey of Direct Preference Optimization

Large Language Models (LLMs) have demonstrated unprecedented generative capabilities, yet their alignment with human values remains critical for ensuring helpful and harmless deployments. While Reinforcement Learning from Human Feedback…

Machine Learning · Computer Science 2025-03-18 Shunyu Liu , Wenkai Fang , Zetian Hu , Junjie Zhang , Yang Zhou , Kongcheng Zhang , Rongcheng Tu , Ting-En Lin , Fei Huang , Mingli Song , Yongbin Li , Dacheng Tao

Towards Analyzing and Understanding the Limitations of DPO: A Theoretical Perspective

Direct Preference Optimization (DPO), which derives reward signals directly from pairwise preference data, has shown its effectiveness on aligning Large Language Models (LLMs) with human preferences. Despite its widespread use across…

Computation and Language · Computer Science 2024-04-09 Duanyu Feng , Bowen Qin , Chen Huang , Zheng Zhang , Wenqiang Lei

Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing

Direct Preference Optimization (DPO) improves the alignment of large language models (LLMs) with human values by training directly on human preference datasets, eliminating the need for reward models. However, due to the presence of…

Artificial Intelligence · Computer Science 2024-06-11 Biqing Qi , Pengfei Li , Fangyuan Li , Junqi Gao , Kaiyan Zhang , Bowen Zhou

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

In the field of large language models (LLMs), aligning models with the diverse preferences of users is a critical challenge. Direct Preference Optimization (DPO) has played a key role in this area. It works by using pairs of preferences…

Computation and Language · Computer Science 2024-05-29 Yueqin Yin , Zhendong Wang , Yi Gu , Hai Huang , Weizhu Chen , Mingyuan Zhou

Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent success of Large Language Models (LLMs), however, it is often a complex and brittle process. In the classical RLHF framework, a reward model is first trained…

Machine Learning · Computer Science 2024-11-06 Rafael Rafailov , Yaswanth Chittepu , Ryan Park , Harshit Sikchi , Joey Hejna , Bradley Knox , Chelsea Finn , Scott Niekum

Listwise Direct Preference Optimization with Multi-Dimensional Preference Mixing

Recent alignment methods based on Direct Preference Optimization (DPO) reformulate preference learning as supervised optimization over pairwise comparisons, offering improved efficiency and stability over reinforcement learning from human…

Machine Learning · Computer Science 2026-01-22 Yuhui Sun , Xiyao Wang , Zixi Li , YiTian Ding , Tianyang Ling , Jialuo Chen , Tianyi Yu , Zhenlong Yuan , Jinman Zhao

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an…

Artificial Intelligence · Computer Science 2025-07-15 Wenyi Xiao , Zechuan Wang , Leilei Gan , Shuai Zhao , Zongrui Li , Ruirui Lei , Wanggui He , Luu Anh Tuan , Long Chen , Hao Jiang , Zhou Zhao , Fei Wu

Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly optimizing human preferences without an explicit reward model. We find that during DPO training,…

Computation and Language · Computer Science 2026-01-01 Junshu Pan , Wei Shen , Shulin Huang , Qiji Zhou , Yue Zhang

Ambiguity Awareness Optimization: Towards Semantic Disambiguation for Direct Preference Optimization

Direct Preference Optimization (DPO) is a widely used reinforcement learning from human feedback (RLHF) method across various domains. Recent research has increasingly focused on the role of token importance in improving DPO effectiveness.…

Computation and Language · Computer Science 2025-12-01 Jian Li , Shenglin Yin , Yujia Zhang , Alan Zhao , Xi Chen , Xiaohui Zhou , Pengfei Xu

AlphaPO: Reward Shape Matters for LLM Alignment

Reinforcement Learning with Human Feedback (RLHF) and its variants have made huge strides toward the effective alignment of large language models (LLMs) to follow instructions and reflect human values. More recently, Direct Alignment…

Computation and Language · Computer Science 2025-06-02 Aman Gupta , Shao Tang , Qingquan Song , Sirou Zhu , Jiwoo Hong , Ankan Saha , Viral Gupta , Noah Lee , Eunki Kim , Siyu Zhu , Parag Agrawal , Natesh Pillai , S. Sathiya Keerthi

A Statistical Framework for Alignment with Biased AI Feedback

Modern alignment pipelines are increasingly replacing expensive human preference labels with evaluations from large language models (LLM-as-Judge). However, AI labels can be systematically biased compared to high-quality human feedback…

Machine Learning · Statistics 2026-02-10 Xintao Xia , Zhiqiu Xia , Linjun Zhang , Zhanrui Cai

DA-DPO: Cost-efficient Difficulty-aware Preference Optimization for Reducing MLLM Hallucinations

Direct Preference Optimization (DPO) has shown strong potential for mitigating hallucinations in Multimodal Large Language Models (MLLMs). However, existing multimodal DPO approaches often suffer from overfitting due to the difficulty…

Artificial Intelligence · Computer Science 2026-01-05 Longtian Qiu , Shan Ning , Chuyu Zhang , Jiaxuan Sun , Xuming He