English
Related papers

Related papers: What Is Preference Optimization Doing, and Why?

200 papers

Direct Preference Optimization (DPO) has emerged as a simple and effective approach for aligning large language models (LLMs) with human preferences, bypassing the need for a learned reward model. Despite its growing adoption, a fundamental…

Machine Learning · Computer Science 2025-11-10 Yu Pan , Zhongze Cai , Guanting Chen , Huaiyang Zhong , Chonghuan Wang

Aligning large language models (LLMs) with human preferences has gained significant attention, with Proximal Policy Optimization (PPO) as a standard yet computationally expensive method and Direct Preference Optimization (DPO) as a more…

Artificial Intelligence · Computer Science 2025-02-10 Yuzi Yan , Yibo Miao , Jialian Li , Yipin Zhang , Jian Xie , Zhijie Deng , Dong Yan

Direct Preference Optimization (DPO) has become a widely used training method for the instruction fine-tuning of large language models (LLMs). In this work, we explore an under-investigated aspect of DPO - its dependency on the reference…

Computation and Language · Computer Science 2024-08-23 Yixin Liu , Pengfei Liu , Arman Cohan

Large Language Models (LLMs) have demonstrated unprecedented generative capabilities, yet their alignment with human values remains critical for ensuring helpful and harmless deployments. While Reinforcement Learning from Human Feedback…

The rapid development of large language model (LLM) alignment algorithms has resulted in a complex and fragmented landscape, with limited clarity on the effectiveness of different methods and their inter-connections. This paper introduces…

In the field of large language models (LLMs), aligning models with the diverse preferences of users is a critical challenge. Direct Preference Optimization (DPO) has played a key role in this area. It works by using pairs of preferences…

Computation and Language · Computer Science 2024-05-29 Yueqin Yin , Zhendong Wang , Yi Gu , Hai Huang , Weizhu Chen , Mingyuan Zhou

In large language model (LLM)-based recommendation systems, direct preference optimization (DPO) effectively aligns recommendations with user preferences, requiring multi-negative objective functions to leverage abundant implicit-feedback…

Information Retrieval · Computer Science 2026-05-04 Xingyu Hu , Kai Zhang , Jiancan Wu , Shuli Wang , Chi Wang , Wenshuai Chen , Yinhua Zhu , Haitao Wang , Xingxing Wang , Xiang Wang

Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models (LMs). Despite its widespread use, the way preference-based learning is applied varies wildly,…

Computation and Language · Computer Science 2024-10-10 Hamish Ivison , Yizhong Wang , Jiacheng Liu , Zeqiu Wu , Valentina Pyatkin , Nathan Lambert , Noah A. Smith , Yejin Choi , Hannaneh Hajishirzi

Efficient preference optimization algorithms such as Direct Preference Optimization (DPO) have become a popular approach in aligning large language models (LLMs) with human preferences. These algorithms implicitly treat the LLM as a reward…

Computation and Language · Computer Science 2025-07-29 Tong Liu , Xiao Yu , Wenxuan Zhou , Jindong Gu , Volker Tresp

Direct Preference Optimization (DPO), which derives reward signals directly from pairwise preference data, has shown its effectiveness on aligning Large Language Models (LLMs) with human preferences. Despite its widespread use across…

Computation and Language · Computer Science 2024-04-09 Duanyu Feng , Bowen Qin , Chen Huang , Zheng Zhang , Wenqiang Lei

Direct Preference Optimization (DPO) has emerged as a de-facto approach for aligning language models with human preferences. Recent work has shown DPO's effectiveness relies on training data quality. In particular, clear quality differences…

Machine Learning · Computer Science 2025-01-28 Nirav Diwan , Tolga Ergen , Dongsub Shim , Honglak Lee

Direct Preference Optimization (DPO) have emerged as a popular method for aligning Large Language Models (LLMs) with human preferences. While DPO effectively preserves the relative ordering between chosen and rejected responses through…

Computation and Language · Computer Science 2025-06-05 Lin Sun , Chuang Liu , Peng Liu , Bingyang Li , Weijia Lu , Ning Wu

Direct Preference Optimization (DPO) and its variants have become the de facto standards for aligning large language models (LLMs) with human preferences or specific goals. However, DPO requires high-quality preference data and suffers from…

Machine Learning · Computer Science 2024-11-12 Zhuotong Chen , Fang Liu , Jennifer Zhu , Wanyu Du , Yanjun Qi

The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using…

Computation and Language · Computer Science 2025-10-30 Jie Sun , Junkang Wu , Jiancan Wu , Zhibo Zhu , Xingyu Lu , Jun Zhou , Lintao Ma , Xiang Wang

Recently, there has been significant interest in replacing the reward model in Reinforcement Learning with Human Feedback (RLHF) methods for Large Language Models (LLMs), such as Direct Preference Optimization (DPO) and its variants. These…

Computation and Language · Computer Science 2024-09-27 Jian Li , Haojing Huang , Yujia Zhang , Pengfei Xu , Xi Chen , Rui Song , Lida Shi , Jingwen Wang , Hao Xu

In the domain of complex reasoning tasks, such as mathematical reasoning, recent advancements have proposed the use of Direct Preference Optimization (DPO) to suppress output of dispreferred responses, thereby enhancing the long-chain…

Computation and Language · Computer Science 2025-10-27 Weibin Liao , Xu Chu , Yasha Wang

As Large Language Models (LLMs) demonstrate remarkable capabilities learned from vast corpora, concerns regarding data privacy and safety are receiving increasing attention. LLM unlearning, which aims to remove the influence of specific…

Machine Learning · Computer Science 2025-10-07 Kai Qin , Jiaqi Wu , Jianxiang He , Haoyuan Sun , Yifei Zhao , Bin Liang , Yongzhe Chang , Tiantian Zhang , Houde Liu

Aligning the output of Large Language Models (LLMs) with human preferences (e.g., by means of reinforcement learning with human feedback, or RLHF) is essential for ensuring their effectiveness in real-world scenarios. Despite significant…

Artificial Intelligence · Computer Science 2024-10-23 Pietro Bernardelle , Gianluca Demartini

Large Vision-Language Models (LVLMs) or multimodal large language models represent a significant advancement in artificial intelligence, enabling systems to understand and generate content across both visual and textual modalities. While…

Machine Learning · Computer Science 2025-09-09 Thanh Thi Nguyen , Campbell Wilson , Janis Dalins

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an…

Artificial Intelligence · Computer Science 2025-07-15 Wenyi Xiao , Zechuan Wang , Leilei Gan , Shuai Zhao , Zongrui Li , Ruirui Lei , Wanggui He , Luu Anh Tuan , Long Chen , Hao Jiang , Zhou Zhao , Fei Wu
‹ Prev 1 2 3 10 Next ›