Related papers: Diverse Preference Optimization

VPO: Leveraging the Number of Votes in Preference Optimization

Direct Preference Optimization (DPO) trains a language model using human preference data, bypassing the explicit reward modeling phase of Reinforcement Learning from Human Feedback (RLHF). By iterating over sentence pairs in a preference…

Machine Learning · Computer Science 2024-10-31 Jae Hyeon Cho , Minkyung Park , Byung-Jun Lee

Multi-Preference Optimization: Generalizing DPO via Set-Level Contrasts

Direct Preference Optimization (DPO) has become a popular approach for aligning language models using pairwise preferences. However, in practical post-training pipelines, on-policy generation typically yields multiple candidate responses…

Machine Learning · Computer Science 2025-06-23 Taneesh Gupta , Rahul Madhavan , Xuchao Zhang , Nagarajan Natarajan , Chetan Bansal , Saravan Rajmohan

Clear Preferences Leave Traces: Reference Model-Guided Sampling for Preference Learning

Direct Preference Optimization (DPO) has emerged as a de-facto approach for aligning language models with human preferences. Recent work has shown DPO's effectiveness relies on training data quality. In particular, clear quality differences…

Machine Learning · Computer Science 2025-01-28 Nirav Diwan , Tolga Ergen , Dongsub Shim , Honglak Lee

Direct Preference Optimization with an Offset

Direct preference optimization (DPO) is a successful fine-tuning strategy for aligning large language models with human preferences without the need to train a reward model or employ reinforcement learning. DPO, as originally formulated,…

Computation and Language · Computer Science 2024-06-07 Afra Amini , Tim Vieira , Ryan Cotterell

BPO: Revisiting Preference Modeling in Direct Preference Optimization

Direct Preference Optimization (DPO) have emerged as a popular method for aligning Large Language Models (LLMs) with human preferences. While DPO effectively preserves the relative ordering between chosen and rejected responses through…

Computation and Language · Computer Science 2025-06-05 Lin Sun , Chuang Liu , Peng Liu , Bingyang Li , Weijia Lu , Ning Wu

Dual Caption Preference Optimization for Diffusion Models

Recent advancements in human preference optimization, originally developed for Large Language Models (LLMs), have shown significant potential in improving text-to-image diffusion models. These methods aim to learn the distribution of…

Computer Vision and Pattern Recognition · Computer Science 2025-10-21 Amir Saeidi , Yiran Luo , Agneet Chatterjee , Shamanthak Hegde , Bimsara Pathiraja , Yezhou Yang , Chitta Baral

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

In the field of large language models (LLMs), aligning models with the diverse preferences of users is a critical challenge. Direct Preference Optimization (DPO) has played a key role in this area. It works by using pairs of preferences…

Computation and Language · Computer Science 2024-05-29 Yueqin Yin , Zhendong Wang , Yi Gu , Hai Huang , Weizhu Chen , Mingyuan Zhou

DPO-Shift: Shifting the Distribution of Direct Preference Optimization

Direct Preference Optimization (DPO) and its variants have become increasingly popular for aligning language models with human preferences. These methods aim to teach models to better distinguish between chosen (or preferred) and rejected…

Computation and Language · Computer Science 2025-06-09 Xiliang Yang , Feng Jiang , Qianen Zhang , Lei Zhao , Xiao Li

Modifying Large Language Model Post-Training for Diverse Creative Writing

As creative writing tasks do not have singular correct answers, large language models (LLMs) trained to perform these tasks should be able to generate diverse valid outputs. However, LLM post-training often focuses on improving generation…

Computation and Language · Computer Science 2025-03-24 John Joon Young Chung , Vishakh Padmakumar , Melissa Roemmele , Yuqian Sun , Max Kreminski

CAPO: Confidence Aware Preference Optimization Learning for Multilingual Preferences

Preference optimization is a critical post-training technique used to align large language models (LLMs) with human preferences, typically by fine-tuning on ranked response pairs. While methods like Direct Preference Optimization (DPO) have…

Computation and Language · Computer Science 2025-11-12 Rhitabrat Pokharel , Yufei Tao , Ameeta Agrawal

Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization

Direct Preference Optimization (DPO) and its variants have become the de facto standards for aligning large language models (LLMs) with human preferences or specific goals. However, DPO requires high-quality preference data and suffers from…

Machine Learning · Computer Science 2024-11-12 Zhuotong Chen , Fang Liu , Jennifer Zhu , Wanyu Du , Yanjun Qi

Preference Optimization with Multi-Sample Comparisons

Recent advancements in generative models, particularly large language models (LLMs) and diffusion models, have been driven by extensive pretraining on large datasets followed by post-training. However, current post-training methods such as…

Machine Learning · Computer Science 2025-03-27 Chaoqi Wang , Zhuokai Zhao , Chen Zhu , Karthik Abinav Sankararaman , Michal Valko , Xuefei Cao , Zhaorun Chen , Madian Khabsa , Yuxin Chen , Hao Ma , Sinong Wang

What Matters in Data for DPO?

Direct Preference Optimization (DPO) has emerged as a simple and effective approach for aligning large language models (LLMs) with human preferences, bypassing the need for a learned reward model. Despite its growing adoption, a fundamental…

Machine Learning · Computer Science 2025-11-10 Yu Pan , Zhongze Cai , Guanting Chen , Huaiyang Zhong , Chonghuan Wang

Beyond Reward Margin: Rethinking and Resolving Likelihood Displacement in Diffusion Models via Video Generation

Direct Preference Optimization (DPO) has shown promising results in aligning generative outputs with human preferences by distinguishing between chosen and rejected samples. However, a critical limitation of DPO is likelihood displacement,…

Computer Vision and Pattern Recognition · Computer Science 2025-11-25 Ruojun Xu , Yu Kai , Xuhua Ren , Jiaxiang Cheng , Bing Ma , Tianxiang Zheng , Qinhlin Lu

Robust Preference Optimization through Reward Model Distillation

Language model (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on…

Machine Learning · Computer Science 2025-03-04 Adam Fisch , Jacob Eisenstein , Vicky Zayats , Alekh Agarwal , Ahmad Beirami , Chirag Nagpal , Pete Shaw , Jonathan Berant

Rethinking DPO: The Role of Rejected Responses in Preference Misalignment

Direct Preference Optimization (DPO) is a simple and efficient framework that has attracted substantial attention. However, it often struggles to meet its primary objectives -- increasing the generation probability of chosen responses while…

Artificial Intelligence · Computer Science 2025-06-17 Jay Hyeon Cho , JunHyeok Oh , Myunsoo Kim , Byung-Jun Lee

GDPO: Learning to Directly Align Language Models with Diversity Using GFlowNets

A critical component of the current generation of language models is preference alignment, which aims to precisely control the model's behavior to meet human needs and values. The most notable among such methods is Reinforcement Learning…

Artificial Intelligence · Computer Science 2024-10-22 Oh Joon Kwon , Daiki E. Matsunaga , Kee-Eung Kim

Decomposing the Delta: What Do Models Actually Learn from Preference Pairs?

Preference optimization methods such as DPO and KTO are widely used for aligning language models, yet little is understood about what properties of preference data drive downstream reasoning gains. We ask: what aspects of a preference pair…

Computation and Language · Computer Science 2026-04-13 Chia-Hsuan Lee , Mingyang Zhou , Renkun Ni , Zelei Cheng , Sihui Dai , Supriyo Chakraborty , Shixiong Zhang , Sambit Sahu , William Campbell

Towards Analyzing and Understanding the Limitations of DPO: A Theoretical Perspective

Direct Preference Optimization (DPO), which derives reward signals directly from pairwise preference data, has shown its effectiveness on aligning Large Language Models (LLMs) with human preferences. Despite its widespread use across…

Computation and Language · Computer Science 2024-04-09 Duanyu Feng , Bowen Qin , Chen Huang , Zheng Zhang , Wenqiang Lei

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preference data, causing…

Machine Learning · Computer Science 2025-10-08 Hyung Gyu Rho