English
Related papers

Related papers: Filtered Direct Preference Optimization

200 papers

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing…

Machine Learning · Computer Science 2024-07-31 Rafael Rafailov , Archit Sharma , Eric Mitchell , Stefano Ermon , Christopher D. Manning , Chelsea Finn

Large language models in the past have typically relied on some form of reinforcement learning with human feedback (RLHF) to better align model responses with human preferences. However, because of oft-observed instabilities when…

Computation and Language · Computer Science 2024-07-15 Xiangkun Hu , Tong He , David Wipf

Direct preference optimization (DPO) is a form of reinforcement learning from human feedback (RLHF) where the policy is learned directly from preferential feedback. Although many models of human preferences exist, the critical task of…

Machine Learning · Computer Science 2025-03-04 Branislav Kveton , Xintong Li , Julian McAuley , Ryan Rossi , Jingbo Shang , Junda Wu , Tong Yu

Aligning the output of Large Language Models (LLMs) with human preferences (e.g., by means of reinforcement learning with human feedback, or RLHF) is essential for ensuring their effectiveness in real-world scenarios. Despite significant…

Artificial Intelligence · Computer Science 2024-10-23 Pietro Bernardelle , Gianluca Demartini

Large Language Models (LLMs) have demonstrated unprecedented generative capabilities, yet their alignment with human values remains critical for ensuring helpful and harmless deployments. While Reinforcement Learning from Human Feedback…

Using reinforcement learning with human feedback (RLHF) has shown significant promise in fine-tuning diffusion models. Previous methods start by training a reward model that aligns with human preferences, then leverage RL techniques to…

Machine Learning · Computer Science 2024-03-26 Kai Yang , Jian Tao , Jiafei Lyu , Chunjiang Ge , Jiaxin Chen , Qimai Li , Weihan Shen , Xiaolong Zhu , Xiu Li

The rapidly increasing capabilities of large language models (LLMs) raise an urgent need to align AI systems with diverse human preferences to simultaneously enhance their usefulness and safety, despite the often conflicting nature of these…

Machine Learning · Computer Science 2024-03-06 Zixuan Liu , Xiaolin Sun , Zizhan Zheng

Reinforcement Learning from Human Feedback (RLHF) has been commonly used to align the behaviors of Large Language Models (LLMs) with human preferences. Recently, a popular alternative is Direct Policy Optimization (DPO), which replaces an…

Computation and Language · Computer Science 2024-06-03 Runsheng Yu , Yong Wang , Xiaoqi Jiao , Youzhi Zhang , James T. Kwok

Reinforcement Learning from Human Feedback (RLHF) has been a crucial component in the recent success of Large Language Models. However, RLHF is know to exploit biases in human preferences, such as verbosity. A well-formatted and eloquent…

Computation and Language · Computer Science 2024-09-10 Ryan Park , Rafael Rafailov , Stefano Ermon , Chelsea Finn

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an…

Artificial Intelligence · Computer Science 2025-07-15 Wenyi Xiao , Zechuan Wang , Leilei Gan , Shuai Zhao , Zongrui Li , Ruirui Lei , Wanggui He , Luu Anh Tuan , Long Chen , Hao Jiang , Zhou Zhao , Fei Wu

Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly optimizing human preferences without an explicit reward model. We find that during DPO training,…

Computation and Language · Computer Science 2026-01-01 Junshu Pan , Wei Shen , Shulin Huang , Qiji Zhou , Yue Zhang

Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with…

Machine Learning · Computer Science 2025-10-21 Keertana Chidambaram , Karthik Vinay Seetharaman , Vasilis Syrgkanis

Aligning large language models (LLMs) with human preferences in federated learning (FL) is challenging due to decentralized, privacy-sensitive, and highly non-IID preference data. Direct Preference Optimization (DPO) offers an efficient…

Machine Learning · Computer Science 2026-03-23 Kewen Zhu , Liping Yi , Zhiming Zhao , Zhuang Qi , Han Yu , Qinghua Hu

Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with…

Artificial Intelligence · Computer Science 2025-10-20 Keertana Chidambaram , Karthik Vinary Seetharaman , Vasilis Syrgkanis

Large Language Models (LLMs) have become increasingly popular due to their ability to process and generate natural language. However, as they are trained on massive datasets of text, LLMs can inherit harmful biases and produce outputs that…

Computation and Language · Computer Science 2025-01-23 Qi Gou , Cam-Tu Nguyen

Reinforcement Learning from Human Feedback (RLHF) and derivative techniques like Direct Preference Optimization (DPO) are task-alignment algorithms used to repurpose general, foundational models for specific tasks. We show that applying…

Computation and Language · Computer Science 2025-09-30 Kaden Uhlig , Joern Wuebker , Raphael Reinauer , John DeNero

In this paper, we take a step towards a deeper understanding of learning from human preferences by systematically comparing the paradigm of reinforcement learning from human feedback (RLHF) with the recently proposed paradigm of direct…

Machine Learning · Computer Science 2024-06-06 Andi Nika , Debmalya Mandal , Parameswaran Kamalaruban , Georgios Tzannetos , Goran Radanović , Adish Singla

Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences. Direct Preference Optimization (DPO), one of the most popular approaches, formulates RLHF as a…

Machine Learning · Computer Science 2024-10-10 Jiafan He , Huizhuo Yuan , Quanquan Gu

Large language models (LLMs) have shown great potential in natural language processing tasks, but their application to machine translation (MT) remains challenging due to pretraining on English-centric data and the complexity of…

Computation and Language · Computer Science 2025-01-24 Guofeng Cui , Pichao Wang , Yang Liu , Zemian Ke , Zhu Liu , Vimal Bhat

Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free.…

Computation and Language · Computer Science 2024-10-11 Shusheng Xu , Wei Fu , Jiaxuan Gao , Wenjie Ye , Weilin Liu , Zhiyu Mei , Guangju Wang , Chao Yu , Yi Wu
‹ Prev 1 2 3 10 Next ›