Related papers: Preference Packing: Efficient Preference Optimizat…

Plug-and-Play Training Framework for Preference Optimization

Recently, preference optimization methods such as DPO have significantly enhanced large language models (LLMs) in wide tasks including dialogue and question-answering. However, current methods fail to account for the varying difficulty…

Computation and Language · Computer Science 2024-12-31 Jingyuan Ma , Rui Li , Zheng Li , Lei Sha , Zhifang Sui

Multi-Response Preference Optimization with Augmented Ranking Dataset

Recent advancements in Large Language Models (LLMs) have been remarkable, with new models consistently surpassing their predecessors. These advancements are underpinned by extensive research on various training mechanisms. Among these,…

Computation and Language · Computer Science 2024-12-12 Hansle Gwon , Imjin Ahn , Young-Hak Kim , Sanghyun Park , Tae Joon Jun

Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective

Aligning the output of Large Language Models (LLMs) with human preferences (e.g., by means of reinforcement learning with human feedback, or RLHF) is essential for ensuring their effectiveness in real-world scenarios. Despite significant…

Artificial Intelligence · Computer Science 2024-10-23 Pietro Bernardelle , Gianluca Demartini

GroupDPO: Memory efficient Group-wise Direct Preference Optimization

Preference optimization is widely used to align Large Language Models (LLMs) with preference feedback. However, most existing methods train on a single positive-negative pair per prompt, discarding additional supervision available in…

Computation and Language · Computer Science 2026-04-20 Jixuan Leng , Si Si , Hsiang-Fu Yu , Vinod Raman , Inderjit S. Dhillon

Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Effective training of language models (LMs) for mathematical reasoning tasks demands high-quality supervised fine-tuning data. Besides obtaining annotations from human experts, a common alternative is sampling from larger and more powerful…

Computation and Language · Computer Science 2024-07-26 Tianduo Wang , Shichen Li , Wei Lu

Clear Preferences Leave Traces: Reference Model-Guided Sampling for Preference Learning

Direct Preference Optimization (DPO) has emerged as a de-facto approach for aligning language models with human preferences. Recent work has shown DPO's effectiveness relies on training data quality. In particular, clear quality differences…

Machine Learning · Computer Science 2025-01-28 Nirav Diwan , Tolga Ergen , Dongsub Shim , Honglak Lee

Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization

Direct Preference Optimization (DPO) and its variants have become the de facto standards for aligning large language models (LLMs) with human preferences or specific goals. However, DPO requires high-quality preference data and suffers from…

Machine Learning · Computer Science 2024-11-12 Zhuotong Chen , Fang Liu , Jennifer Zhu , Wanyu Du , Yanjun Qi

CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation

Large language models (LLMs) have shown great potential in natural language processing tasks, but their application to machine translation (MT) remains challenging due to pretraining on English-centric data and the complexity of…

Computation and Language · Computer Science 2025-01-24 Guofeng Cui , Pichao Wang , Yang Liu , Zemian Ke , Zhu Liu , Vimal Bhat

Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly optimizing human preferences without an explicit reward model. We find that during DPO training,…

Computation and Language · Computer Science 2026-01-01 Junshu Pan , Wei Shen , Shulin Huang , Qiji Zhou , Yue Zhang

Accelerating Direct Preference Optimization with Prefix Sharing

Offline paired preference optimization algorithms have become a popular approach for fine-tuning on preference data, outperforming traditional supervised fine-tuning in various tasks. However, traditional implementations often involve…

Machine Learning · Computer Science 2024-11-01 Franklin Wang , Sumanth Hegde

Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization

A common technique for aligning large language models (LLMs) relies on acquiring human preferences by comparing multiple generations conditioned on a fixed context. This method, however, relies solely on pairwise comparisons, where the…

Computation and Language · Computer Science 2025-01-09 Hritik Bansal , Ashima Suvarna , Gantavya Bhatt , Nanyun Peng , Kai-Wei Chang , Aditya Grover

Multi-Reference Preference Optimization for Large Language Models

How can Large Language Models (LLMs) be aligned with human intentions and values? A typical solution is to gather human preference on model outputs and finetune the LLMs accordingly while ensuring that updates do not deviate too far from a…

Computation and Language · Computer Science 2024-05-28 Hung Le , Quan Tran , Dung Nguyen , Kien Do , Saloni Mittal , Kelechi Ogueji , Svetha Venkatesh

Active Preference Learning for Large Language Models

As large language models (LLMs) become more capable, fine-tuning techniques for aligning with human intent are increasingly important. A key consideration for aligning these models is how to most effectively use human resources, or model…

Machine Learning · Computer Science 2024-07-01 William Muldrew , Peter Hayes , Mingtian Zhang , David Barber

Preference Alignment Improves Language Model-Based TTS

Recent advancements in text-to-speech (TTS) have shown that language model (LM)-based systems offer competitive performance to their counterparts. Further optimization can be achieved through preference alignment algorithms, which adjust…

Computation and Language · Computer Science 2024-09-20 Jinchuan Tian , Chunlei Zhang , Jiatong Shi , Hao Zhang , Jianwei Yu , Shinji Watanabe , Dong Yu

Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization

Direct Preference Optimization (DPO) has emerged as an effective approach for aligning large language models (LLMs) with human preferences. However, its performance is highly dependent on the quality of the underlying human preference data.…

Machine Learning · Computer Science 2026-03-10 Zixuan Huang , Yikun Ban , Lean Fu , Xiaojie Li , Zhongxiang Dai , Jianxin Li , Deqing Wang

Not All Preferences are What You Need for Post-Training: Selective Alignment Strategy for Preference Optimization

Post-training alignment of large language models (LLMs) is a critical challenge, as not all tokens contribute equally to model performance. This paper introduces a selective alignment strategy that prioritizes high-impact tokens within…

Computation and Language · Computer Science 2025-07-11 Zhijin Dong

What Is Preference Optimization Doing, and Why?

Preference optimization (PO) is indispensable for large language models (LLMs), with methods such as direct preference optimization (DPO) and proximal policy optimization (PPO) achieving great success. A common belief is that DPO is…

Machine Learning · Computer Science 2026-05-18 Yue Wang , Qizhou Wang , Zizhuo Zhang , Gang Niu , Bo Han , Masashi Sugiyama

A Survey of Direct Preference Optimization

Large Language Models (LLMs) have demonstrated unprecedented generative capabilities, yet their alignment with human values remains critical for ensuring helpful and harmless deployments. While Reinforcement Learning from Human Feedback…

Machine Learning · Computer Science 2025-03-18 Shunyu Liu , Wenkai Fang , Zetian Hu , Junjie Zhang , Yang Zhou , Kongcheng Zhang , Rongcheng Tu , Ting-En Lin , Fei Huang , Mingli Song , Yongbin Li , Dacheng Tao

When Data is the Algorithm: A Systematic Study and Curation of Preference Optimization Datasets

Aligning large language models (LLMs) is a central objective of post-training, often achieved through reward modeling and reinforcement learning methods. Among these, direct preference optimization (DPO) has emerged as a widely adopted…

Computation and Language · Computer Science 2026-03-03 Aladin Djuhera , Farhan Ahmed , Swanand Ravindra Kadhe , Syed Zawad , Heiko Ludwig , Holger Boche

PORT: Preference Optimization on Reasoning Traces

Preference optimization methods have been successfully applied to improve not only the alignment of large language models (LLMs) with human values, but also specific natural language tasks such as summarization and stylistic continuations.…

Machine Learning · Computer Science 2025-02-06 Salem Lahlou , Abdalgader Abubaker , Hakim Hacid