Related papers: Accelerating Direct Preference Optimization with P…

Direct Preference Optimization with an Offset

Direct preference optimization (DPO) is a successful fine-tuning strategy for aligning large language models with human preferences without the need to train a reward model or employ reinforcement learning. DPO, as originally formulated,…

Computation and Language · Computer Science 2024-06-07 Afra Amini , Tim Vieira , Ryan Cotterell

Preference Packing: Efficient Preference Optimization for Large Language Models

Resource-efficient training optimization techniques are becoming increasingly important as the size of large language models (LLMs) continues to grow. In particular, batch packing is commonly used in pre-training and supervised fine-tuning…

Computation and Language · Computer Science 2026-03-02 Jaekyung Cho

Clear Preferences Leave Traces: Reference Model-Guided Sampling for Preference Learning

Direct Preference Optimization (DPO) has emerged as a de-facto approach for aligning language models with human preferences. Recent work has shown DPO's effectiveness relies on training data quality. In particular, clear quality differences…

Machine Learning · Computer Science 2025-01-28 Nirav Diwan , Tolga Ergen , Dongsub Shim , Honglak Lee

Preference-Based Alignment of Discrete Diffusion Models

Diffusion models have achieved state-of-the-art performance across multiple domains, with recent advancements extending their applicability to discrete data. However, aligning discrete diffusion models with task-specific preferences remains…

Machine Learning · Computer Science 2025-04-10 Umberto Borso , Davide Paglieri , Jude Wells , Tim Rocktäschel

Prefix Propagation: Parameter-Efficient Tuning for Long Sequences

Parameter-efficient tuning aims to mitigate the large memory requirements of adapting pretrained language models for downstream tasks. For example, one popular method, prefix-tuning, prepends trainable tokens to sequences while freezing the…

Computation and Language · Computer Science 2023-05-26 Jonathan Li , Will Aitken , Rohan Bhambhoria , Xiaodan Zhu

What Matters in Data for DPO?

Direct Preference Optimization (DPO) has emerged as a simple and effective approach for aligning large language models (LLMs) with human preferences, bypassing the need for a learned reward model. Despite its growing adoption, a fundamental…

Machine Learning · Computer Science 2025-11-10 Yu Pan , Zhongze Cai , Guanting Chen , Huaiyang Zhong , Chonghuan Wang

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is…

Computer Vision and Pattern Recognition · Computer Science 2026-05-21 Zitong Huang , Kaidong Zhang , Yukang Ding , Chao Gao , Rui Ding , Ying Chen , Wangmeng Zuo

Distributed Direct Preference Optimization

Preference-based reinforcement learning (RL) is a key paradigm for aligning policies with human judgments, yet its theoretical behavior in distributed settings where preference data are fragmented across heterogeneous users remains poorly…

Machine Learning · Computer Science 2026-05-21 Zhanhong Jiang

Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing

Direct Preference Optimization (DPO) improves the alignment of large language models (LLMs) with human values by training directly on human preference datasets, eliminating the need for reward models. However, due to the presence of…

Artificial Intelligence · Computer Science 2024-06-11 Biqing Qi , Pengfei Li , Fangyuan Li , Junqi Gao , Kaiyan Zhang , Bowen Zhou

Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization

Direct Preference Optimization (DPO) and its variants have become the de facto standards for aligning large language models (LLMs) with human preferences or specific goals. However, DPO requires high-quality preference data and suffers from…

Machine Learning · Computer Science 2024-11-12 Zhuotong Chen , Fang Liu , Jennifer Zhu , Wanyu Du , Yanjun Qi

g-DPO: Scalable Preference Optimization for Protein Language Models

Direct Preference Optimization (DPO) is an effective approach for aligning protein language models with experimental design goals. However, DPO faces a scalability bottleneck: the number of possible training pairs grows quadratically with…

Machine Learning · Computer Science 2025-11-27 Constance Ferragu , Jonathan D. Ziegler , Nicolas Deutschmann , Arthur Lindoulsi , Eli Bixby , Cradle ML Team

Lightweight Robust Direct Preference Optimization

Direct Preference Optimization (DPO) has become a popular method for fine-tuning large language models (LLMs) due to its stability and simplicity. However, it is also known to be sensitive to noise in the data and prone to overfitting.…

Machine Learning · Computer Science 2025-10-28 Cheol Woo Kim , Shresth Verma , Mauricio Tec , Milind Tambe

FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings

Efficient preference optimization algorithms such as Direct Preference Optimization (DPO) have become a popular approach in aligning large language models (LLMs) with human preferences. These algorithms implicitly treat the LLM as a reward…

Computation and Language · Computer Science 2025-07-29 Tong Liu , Xiao Yu , Wenxuan Zhou , Jindong Gu , Volker Tresp

Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph

Direct Preference Optimization (DPO) aligns language models using pairwise preference comparisons, offering a simple and effective alternative to Reinforcement Learning (RL) from human feedback. However, in many practical settings, training…

Machine Learning · Computer Science 2026-05-11 Ning Liu , Chuanneng Sun , Kristina Klinkner , Shervin Malmasi

DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

Direct Preference Optimization (DPO) has recently been applied as a post-training technique for text-to-video diffusion models. To obtain training data, annotators are asked to provide preferences between two videos generated from…

Computer Vision and Pattern Recognition · Computer Science 2025-10-13 Ziyi Wu , Anil Kag , Ivan Skorokhodov , Willi Menapace , Ashkan Mirzaei , Igor Gilitschenski , Sergey Tulyakov , Aliaksandr Siarohin

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

Direct Preference Optimization (DPO) has recently emerged as a popular approach to improve reinforcement learning with human feedback (RLHF), leading to better techniques to fine-tune large language models (LLM). A weakness of DPO, however,…

Machine Learning · Computer Science 2025-04-21 Haoxian Chen , Hanyang Zhao , Henry Lam , David Yao , Wenpin Tang

DPO-Shift: Shifting the Distribution of Direct Preference Optimization

Direct Preference Optimization (DPO) and its variants have become increasingly popular for aligning language models with human preferences. These methods aim to teach models to better distinguish between chosen (or preferred) and rejected…

Computation and Language · Computer Science 2025-06-09 Xiliang Yang , Feng Jiang , Qianen Zhang , Lei Zhao , Xiao Li

Direct Diffusion Score Preference Optimization via Stepwise Contrastive Policy-Pair Supervision

Diffusion models have achieved impressive results in generative tasks such as text-to-image synthesis, yet they often struggle to fully align outputs with nuanced user intent and maintain consistent aesthetic quality. Existing…

Computer Vision and Pattern Recognition · Computer Science 2025-12-30 Dohyun Kim , Seungwoo Lyu , Seung Wook Kim , Paul Hongsuck Seo

Diffusion Model Alignment Using Direct Preference Optimization

Large language models (LLMs) are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users' preferences. In contrast to LLMs, human preference learning has…

Computer Vision and Pattern Recognition · Computer Science 2023-11-23 Bram Wallace , Meihua Dang , Rafael Rafailov , Linqi Zhou , Aaron Lou , Senthil Purushwalkam , Stefano Ermon , Caiming Xiong , Shafiq Joty , Nikhil Naik

Diffusion-SDPO: Safeguarded Direct Preference Optimization for Diffusion Models

Text-to-image diffusion models deliver high-quality images, yet aligning them with human preferences remains challenging. We revisit diffusion-based Direct Preference Optimization (DPO) for these models and identify a critical pathology:…

Computer Vision and Pattern Recognition · Computer Science 2025-12-03 Minghao Fu , Guo-Hua Wang , Tianyu Cui , Qing-Guo Chen , Zhao Xu , Weihua Luo , Kaifu Zhang