Related papers: Direct Preference Optimization with an Offset

Clear Preferences Leave Traces: Reference Model-Guided Sampling for Preference Learning

Direct Preference Optimization (DPO) has emerged as a de-facto approach for aligning language models with human preferences. Recent work has shown DPO's effectiveness relies on training data quality. In particular, clear quality differences…

Machine Learning · Computer Science 2025-01-28 Nirav Diwan , Tolga Ergen , Dongsub Shim , Honglak Lee

What Matters in Data for DPO?

Direct Preference Optimization (DPO) has emerged as a simple and effective approach for aligning large language models (LLMs) with human preferences, bypassing the need for a learned reward model. Despite its growing adoption, a fundamental…

Machine Learning · Computer Science 2025-11-10 Yu Pan , Zhongze Cai , Guanting Chen , Huaiyang Zhong , Chonghuan Wang

Towards Analyzing and Understanding the Limitations of DPO: A Theoretical Perspective

Direct Preference Optimization (DPO), which derives reward signals directly from pairwise preference data, has shown its effectiveness on aligning Large Language Models (LLMs) with human preferences. Despite its widespread use across…

Computation and Language · Computer Science 2024-04-09 Duanyu Feng , Bowen Qin , Chen Huang , Zheng Zhang , Wenqiang Lei

BPO: Revisiting Preference Modeling in Direct Preference Optimization

Direct Preference Optimization (DPO) have emerged as a popular method for aligning Large Language Models (LLMs) with human preferences. While DPO effectively preserves the relative ordering between chosen and rejected responses through…

Computation and Language · Computer Science 2025-06-05 Lin Sun , Chuang Liu , Peng Liu , Bingyang Li , Weijia Lu , Ning Wu

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

In the field of large language models (LLMs), aligning models with the diverse preferences of users is a critical challenge. Direct Preference Optimization (DPO) has played a key role in this area. It works by using pairs of preferences…

Computation and Language · Computer Science 2024-05-29 Yueqin Yin , Zhendong Wang , Yi Gu , Hai Huang , Weizhu Chen , Mingyuan Zhou

Robust Preference Optimization through Reward Model Distillation

Language model (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on…

Machine Learning · Computer Science 2025-03-04 Adam Fisch , Jacob Eisenstein , Vicky Zayats , Alekh Agarwal , Ahmad Beirami , Chirag Nagpal , Pete Shaw , Jonathan Berant

Understanding Reference Policies in Direct Preference Optimization

Direct Preference Optimization (DPO) has become a widely used training method for the instruction fine-tuning of large language models (LLMs). In this work, we explore an under-investigated aspect of DPO - its dependency on the reference…

Computation and Language · Computer Science 2024-08-23 Yixin Liu , Pengfei Liu , Arman Cohan

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an…

Artificial Intelligence · Computer Science 2025-07-15 Wenyi Xiao , Zechuan Wang , Leilei Gan , Shuai Zhao , Zongrui Li , Ruirui Lei , Wanggui He , Luu Anh Tuan , Long Chen , Hao Jiang , Zhou Zhao , Fei Wu

DeDPO: Debiased Direct Preference Optimization for Diffusion Models

Direct Preference Optimization (DPO) has emerged as a predominant alignment method for diffusion models, facilitating off-policy training without explicit reward modeling. However, its reliance on large-scale, high-quality human preference…

Computer Vision and Pattern Recognition · Computer Science 2026-02-09 Khiem Pham , Quang Nguyen , Tung Nguyen , Jingsen Zhu , Michele Santacatterina , Dimitris Metaxas , Ramin Zabih

Multi-Preference Optimization: Generalizing DPO via Set-Level Contrasts

Direct Preference Optimization (DPO) has become a popular approach for aligning language models using pairwise preferences. However, in practical post-training pipelines, on-policy generation typically yields multiple candidate responses…

Machine Learning · Computer Science 2025-06-23 Taneesh Gupta , Rahul Madhavan , Xuchao Zhang , Nagarajan Natarajan , Chetan Bansal , Saravan Rajmohan

Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization

Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models (LLMs) with human preferences by directly optimizing the log-likelihood difference between chosen and rejected responses. However,…

Computation and Language · Computer Science 2025-05-27 Meng Li , Guangda Huzhang , Haibo Zhang , Xiting Wang , Anxiang Zeng

Active Learning for Direct Preference Optimization

Direct preference optimization (DPO) is a form of reinforcement learning from human feedback (RLHF) where the policy is learned directly from preferential feedback. Although many models of human preferences exist, the critical task of…

Machine Learning · Computer Science 2025-03-04 Branislav Kveton , Xintong Li , Julian McAuley , Ryan Rossi , Jingbo Shang , Junda Wu , Tong Yu

Inducing Robustness in a 2 Dimensional Direct Preference Optimization Paradigm

Direct Preference Optimisation (DPO) has emerged as a powerful method for aligning Large Language Models (LLMs) with human preferences, offering a stable and efficient alternative to approaches that use Reinforcement learning via Human…

Artificial Intelligence · Computer Science 2025-05-06 Sarvesh Shashidhar , Ritik , Nachiketa Patil , Suraj Racha , Ganesh Ramakrishnan

VPO: Leveraging the Number of Votes in Preference Optimization

Direct Preference Optimization (DPO) trains a language model using human preference data, bypassing the explicit reward modeling phase of Reinforcement Learning from Human Feedback (RLHF). By iterating over sentence pairs in a preference…

Machine Learning · Computer Science 2024-10-31 Jae Hyeon Cho , Minkyung Park , Byung-Jun Lee

DPO-Shift: Shifting the Distribution of Direct Preference Optimization

Direct Preference Optimization (DPO) and its variants have become increasingly popular for aligning language models with human preferences. These methods aim to teach models to better distinguish between chosen (or preferred) and rejected…

Computation and Language · Computer Science 2025-06-09 Xiliang Yang , Feng Jiang , Qianen Zhang , Lei Zhao , Xiao Li

Lightweight Robust Direct Preference Optimization

Direct Preference Optimization (DPO) has become a popular method for fine-tuning large language models (LLMs) due to its stability and simplicity. However, it is also known to be sensitive to noise in the data and prone to overfitting.…

Machine Learning · Computer Science 2025-10-28 Cheol Woo Kim , Shresth Verma , Mauricio Tec , Milind Tambe

Multi-Reference Preference Optimization for Large Language Models

How can Large Language Models (LLMs) be aligned with human intentions and values? A typical solution is to gather human preference on model outputs and finetune the LLMs accordingly while ensuring that updates do not deviate too far from a…

Computation and Language · Computer Science 2024-05-28 Hung Le , Quan Tran , Dung Nguyen , Kien Do , Saloni Mittal , Kelechi Ogueji , Svetha Venkatesh

$\beta$-DPO: Direct Preference Optimization with Dynamic $\beta$

Direct Preference Optimization (DPO) has emerged as a compelling approach for training Large Language Models (LLMs) to adhere to human preferences. However, the performance of DPO is sensitive to the fine-tuning of its trade-off parameter…

Artificial Intelligence · Computer Science 2024-10-15 Junkang Wu , Yuexiang Xie , Zhengyi Yang , Jiancan Wu , Jinyang Gao , Bolin Ding , Xiang Wang , Xiangnan He

HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

Direct Preference Optimization (DPO) is an effective framework for aligning large language models with human preferences, but it struggles with complex reasoning tasks. DPO optimizes for the likelihood of generating preferred over…

Artificial Intelligence · Computer Science 2026-04-23 Darsh Kachroo , Adriana Caraeni , Arjun Prasaath Anbazhagan , Brennan Lagasse , Kevin Zhu

New Desiderata for Direct Preference Optimization

Large language models in the past have typically relied on some form of reinforcement learning with human feedback (RLHF) to better align model responses with human preferences. However, because of oft-observed instabilities when…

Computation and Language · Computer Science 2024-07-15 Xiangkun Hu , Tong He , David Wipf