English
Related papers

Related papers: Exploratory Preference Optimization: Harnessing Im…

200 papers

Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for large language model (LLM) alignment. This paper studies the setting of online RLHF and focus on improving sample efficiency. All existing algorithms…

Machine Learning · Computer Science 2025-09-29 Mingyu Chen , Yiding Chen , Wen Sun , Xuezhou Zhang

Reinforcement Learning from Human Feedback (RLHF) is currently the leading approach for aligning large language models with human preferences. Typically, these models rely on extensive offline preference datasets for training. However,…

Machine Learning · Computer Science 2024-12-17 Avinandan Bose , Zhihan Xiong , Aadirupa Saha , Simon Shaolei Du , Maryam Fazel

Direct preference optimization (DPO) is a form of reinforcement learning from human feedback (RLHF) where the policy is learned directly from preferential feedback. Although many models of human preferences exist, the critical task of…

Machine Learning · Computer Science 2025-03-04 Branislav Kveton , Xintong Li , Julian McAuley , Ryan Rossi , Jingbo Shang , Junda Wu , Tong Yu

The generated responses of large language models (LLMs) are often fine-tuned to human preferences through a process called reinforcement learning from human feedback (RLHF). As RLHF relies on a challenging training sequence, whereby a…

Machine Learning · Computer Science 2025-06-10 Xiangkun Hu , Lemin Kong , Tong He , David Wipf

Online reinforcement learning from human feedback (RLHF) has emerged as a promising paradigm for aligning large language models (LLMs) by continuously collecting new preference feedback during training. A foundational challenge in this…

Machine Learning · Computer Science 2026-05-07 Zhen-Yu Zhang , Yuting Tang , Jiandong Zhang , Lanjihong Ma , Masashi Sugiyama

Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather…

Artificial Intelligence · Computer Science 2026-05-21 Zhiqin Yang , Yonggang Zhang , Wei Xue , Dong Fang , Bo Han , Yike Guo

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing…

Machine Learning · Computer Science 2024-07-31 Rafael Rafailov , Archit Sharma , Eric Mitchell , Stefano Ermon , Christopher D. Manning , Chelsea Finn

Large Language Models (LLMs) have demonstrated unprecedented generative capabilities, yet their alignment with human values remains critical for ensuring helpful and harmless deployments. While Reinforcement Learning from Human Feedback…

This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF). We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in…

Machine Learning · Computer Science 2024-05-02 Wei Xiong , Hanze Dong , Chenlu Ye , Ziqi Wang , Han Zhong , Heng Ji , Nan Jiang , Tong Zhang

Large language models in the past have typically relied on some form of reinforcement learning with human feedback (RLHF) to better align model responses with human preferences. However, because of oft-observed instabilities when…

Computation and Language · Computer Science 2024-07-15 Xiangkun Hu , Tong He , David Wipf

As large language models (LLMs) become more capable, fine-tuning techniques for aligning with human intent are increasingly important. A key consideration for aligning these models is how to most effectively use human resources, or model…

Machine Learning · Computer Science 2024-07-01 William Muldrew , Peter Hayes , Mingtian Zhang , David Barber

Preference learning is a key technology for aligning language models with human values. Reinforcement Learning from Human Feedback (RLHF) is a model-based algorithm to optimize preference learning, which first fits a reward model for…

Machine Learning · Computer Science 2024-03-26 Zaifan Jiang , Xing Huang , Chao Wei

Online and offline RLHF methods, such as PPO and DPO, have been highly successful in aligning AI with human preferences. Despite their success, however, these methods suffer from fundamental limitations: (a) Models trained with RLHF can…

Machine Learning · Computer Science 2025-04-15 Eugene Choi , Arash Ahmadian , Matthieu Geist , Oilvier Pietquin , Mohammad Gheshlaghi Azar

Reinforcement Learning from Human Feedback (RLHF) has shown great potential in fine-tuning Large Language Models (LLMs) to align with human preferences. Existing methods perform preference alignment from a fixed dataset, which can be…

Machine Learning · Computer Science 2025-02-10 Chenjia Bai , Yang Zhang , Shuang Qiu , Qiaosheng Zhang , Kang Xu , Xuelong Li

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an…

Artificial Intelligence · Computer Science 2025-07-15 Wenyi Xiao , Zechuan Wang , Leilei Gan , Shuai Zhao , Zongrui Li , Ruirui Lei , Wanggui He , Luu Anh Tuan , Long Chen , Hao Jiang , Zhou Zhao , Fei Wu

Large Language Models (LLMs) have become increasingly popular due to their ability to process and generate natural language. However, as they are trained on massive datasets of text, LLMs can inherit harmful biases and produce outputs that…

Computation and Language · Computer Science 2025-01-23 Qi Gou , Cam-Tu Nguyen

Reinforcement Learning From Human Feedback (RLHF) has been critical to the success of the latest generation of generative AI models. In response to the complex nature of the classical RLHF pipeline, direct alignment algorithms such as…

Machine Learning · Computer Science 2024-08-14 Rafael Rafailov , Joey Hejna , Ryan Park , Chelsea Finn

Reinforcement learning from human feedback (RLHF) has demonstrated great promise in aligning large language models (LLMs) with human preference. Depending on the availability of preference data, both online and offline RLHF are active areas…

Machine Learning · Computer Science 2025-02-20 Shicong Cen , Jincheng Mei , Katayoon Goshvadi , Hanjun Dai , Tong Yang , Sherry Yang , Dale Schuurmans , Yuejie Chi , Bo Dai

Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a…

Reinforcement learning from human feedback (RLHF) plays a crucial role in aligning language models with human preferences. While the significance of dataset quality is generally recognized, explicit investigations into its impact within the…

Machine Learning · Computer Science 2024-12-04 Tetsuro Morimura , Mitsuki Sakamoto , Yuu Jinnai , Kenshi Abe , Kaito Ariu
‹ Prev 1 2 3 10 Next ›