Related papers: New Desiderata for Direct Preference Optimization

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing…

Machine Learning · Computer Science 2024-07-31 Rafael Rafailov , Archit Sharma , Eric Mitchell , Stefano Ermon , Christopher D. Manning , Chelsea Finn

Explicit Preference Optimization: No Need for an Implicit Reward Model

The generated responses of large language models (LLMs) are often fine-tuned to human preferences through a process called reinforcement learning from human feedback (RLHF). As RLHF relies on a challenging training sequence, whereby a…

Machine Learning · Computer Science 2025-06-10 Xiangkun Hu , Lemin Kong , Tong He , David Wipf

A Survey of Direct Preference Optimization

Large Language Models (LLMs) have demonstrated unprecedented generative capabilities, yet their alignment with human values remains critical for ensuring helpful and harmless deployments. While Reinforcement Learning from Human Feedback…

Machine Learning · Computer Science 2025-03-18 Shunyu Liu , Wenkai Fang , Zetian Hu , Junjie Zhang , Yang Zhou , Kongcheng Zhang , Rongcheng Tu , Ting-En Lin , Fei Huang , Mingli Song , Yongbin Li , Dacheng Tao

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an…

Artificial Intelligence · Computer Science 2025-07-15 Wenyi Xiao , Zechuan Wang , Leilei Gan , Shuai Zhao , Zongrui Li , Ruirui Lei , Wanggui He , Luu Anh Tuan , Long Chen , Hao Jiang , Zhou Zhao , Fei Wu

Direct Preference Optimization With Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences

Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with…

Machine Learning · Computer Science 2025-10-21 Keertana Chidambaram , Karthik Vinay Seetharaman , Vasilis Syrgkanis

Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences

Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with…

Artificial Intelligence · Computer Science 2025-10-20 Keertana Chidambaram , Karthik Vinary Seetharaman , Vasilis Syrgkanis

Filtered Direct Preference Optimization

Reinforcement learning from human feedback (RLHF) plays a crucial role in aligning language models with human preferences. While the significance of dataset quality is generally recognized, explicit investigations into its impact within the…

Machine Learning · Computer Science 2024-12-04 Tetsuro Morimura , Mitsuki Sakamoto , Yuu Jinnai , Kenshi Abe , Kaito Ariu

Active Learning for Direct Preference Optimization

Direct preference optimization (DPO) is a form of reinforcement learning from human feedback (RLHF) where the policy is learned directly from preferential feedback. Although many models of human preferences exist, the critical task of…

Machine Learning · Computer Science 2025-03-04 Branislav Kveton , Xintong Li , Julian McAuley , Ryan Rossi , Jingbo Shang , Junda Wu , Tong Yu

Direct Alignment of Language Models via Quality-Aware Self-Refinement

Reinforcement Learning from Human Feedback (RLHF) has been commonly used to align the behaviors of Large Language Models (LLMs) with human preferences. Recently, a popular alternative is Direct Policy Optimization (DPO), which replaces an…

Computation and Language · Computer Science 2024-06-03 Runsheng Yu , Yong Wang , Xiaoqi Jiao , Youzhi Zhang , James T. Kwok

Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective

Aligning the output of Large Language Models (LLMs) with human preferences (e.g., by means of reinforcement learning with human feedback, or RLHF) is essential for ensuring their effectiveness in real-world scenarios. Despite significant…

Artificial Intelligence · Computer Science 2024-10-23 Pietro Bernardelle , Gianluca Demartini

Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

Using reinforcement learning with human feedback (RLHF) has shown significant promise in fine-tuning diffusion models. Previous methods start by training a reward model that aligns with human preferences, then leverage RL techniques to…

Machine Learning · Computer Science 2024-03-26 Kai Yang , Jian Tao , Jiafei Lyu , Chunjiang Ge , Jiaxin Chen , Qimai Li , Weihan Shen , Xiaolong Zhu , Xiu Li

Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferences

In this paper, we take a step towards a deeper understanding of learning from human preferences by systematically comparing the paradigm of reinforcement learning from human feedback (RLHF) with the recently proposed paradigm of direct…

Machine Learning · Computer Science 2024-06-06 Andi Nika , Debmalya Mandal , Parameswaran Kamalaruban , Georgios Tzannetos , Goran Radanović , Adish Singla

Accelerated Preference Optimization for Large Language Model Alignment

Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences. Direct Preference Optimization (DPO), one of the most popular approaches, formulates RLHF as a…

Machine Learning · Computer Science 2024-10-10 Jiafan He , Huizhuo Yuan , Quanquan Gu

Multi-Reference Preference Optimization for Large Language Models

How can Large Language Models (LLMs) be aligned with human intentions and values? A typical solution is to gather human preference on model outputs and finetune the LLMs accordingly while ensuring that updates do not deviate too far from a…

Computation and Language · Computer Science 2024-05-28 Hung Le , Quan Tran , Dung Nguyen , Kien Do , Saloni Mittal , Kelechi Ogueji , Svetha Venkatesh

Active Preference Learning for Large Language Models

As large language models (LLMs) become more capable, fine-tuning techniques for aligning with human intent are increasingly important. A key consideration for aligning these models is how to most effectively use human resources, or model…

Machine Learning · Computer Science 2024-07-01 William Muldrew , Peter Hayes , Mingtian Zhang , David Barber

RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

Reinforcement learning from human feedback (RLHF) has been extensively employed to align large language models with user intent. However, proximal policy optimization (PPO) based RLHF is occasionally unstable requiring significant…

Computation and Language · Computer Science 2024-04-02 Saeed Khaki , JinJin Li , Lan Ma , Liu Yang , Prathap Ramachandra

On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization

Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a…

Machine Learning · Computer Science 2024-10-04 Yong Lin , Skyler Seto , Maartje ter Hoeve , Katherine Metcalf , Barry-John Theobald , Xuan Wang , Yizhe Zhang , Chen Huang , Tong Zhang

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Large Language Models (LLMs) have become increasingly popular due to their ability to process and generate natural language. However, as they are trained on massive datasets of text, LLMs can inherit harmful biases and produce outputs that…

Computation and Language · Computer Science 2025-01-23 Qi Gou , Cam-Tu Nguyen

A General Theoretical Paradigm to Understand Learning from Human Preferences

The prevalent deployment of learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second…

Artificial Intelligence · Computer Science 2023-11-23 Mohammad Gheshlaghi Azar , Mark Rowland , Bilal Piot , Daniel Guo , Daniele Calandriello , Michal Valko , Rémi Munos

Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly optimizing human preferences without an explicit reward model. We find that during DPO training,…

Computation and Language · Computer Science 2026-01-01 Junshu Pan , Wei Shen , Shulin Huang , Qiji Zhou , Yue Zhang