English
Related papers

Related papers: Self-Improving Robust Preference Optimization

200 papers

Reinforcement Learning with Human Feedback (RLHF) enhances the alignment of Large Language Models (LLMs). However, its limitations have led to the development of Direct Preference Optimization (DPO), an RL-free approach designed to overcome…

Computation and Language · Computer Science 2025-02-19 Amir Saeidi , Shivanshu Verma , Aswin RRV , Kashif Rasul , Chitta Baral

Preference learning is a key technology for aligning language models with human values. Reinforcement Learning from Human Feedback (RLHF) is a model-based algorithm to optimize preference learning, which first fits a reward model for…

Machine Learning · Computer Science 2024-03-26 Zaifan Jiang , Xing Huang , Chao Wei

Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for large language model (LLM) alignment. This paper studies the setting of online RLHF and focus on improving sample efficiency. All existing algorithms…

Machine Learning · Computer Science 2025-09-29 Mingyu Chen , Yiding Chen , Wen Sun , Xuezhou Zhang

Recently, there has been significant interest in replacing the reward model in Reinforcement Learning with Human Feedback (RLHF) methods for Large Language Models (LLMs), such as Direct Preference Optimization (DPO) and its variants. These…

Computation and Language · Computer Science 2024-09-27 Jian Li , Haojing Huang , Yujia Zhang , Pengfei Xu , Xi Chen , Rui Song , Lida Shi , Jingwen Wang , Hao Xu

Reinforcement Learning from Human Feedback (RLHF) is currently the leading approach for aligning large language models with human preferences. Typically, these models rely on extensive offline preference datasets for training. However,…

Machine Learning · Computer Science 2024-12-17 Avinandan Bose , Zhihan Xiong , Aadirupa Saha , Simon Shaolei Du , Maryam Fazel

Reinforcement learning from human feedback (RLHF) has demonstrated great promise in aligning large language models (LLMs) with human preference. Depending on the availability of preference data, both online and offline RLHF are active areas…

Machine Learning · Computer Science 2025-02-20 Shicong Cen , Jincheng Mei , Katayoon Goshvadi , Hanjun Dai , Tong Yang , Sherry Yang , Dale Schuurmans , Yuejie Chi , Bo Dai

This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF). We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in…

Machine Learning · Computer Science 2024-05-02 Wei Xiong , Hanze Dong , Chenlu Ye , Ziqi Wang , Han Zhong , Heng Ji , Nan Jiang , Tong Zhang

Large language models in the past have typically relied on some form of reinforcement learning with human feedback (RLHF) to better align model responses with human preferences. However, because of oft-observed instabilities when…

Computation and Language · Computer Science 2024-07-15 Xiangkun Hu , Tong He , David Wipf

Adapting large language models (LLMs) for specific tasks usually involves fine-tuning through reinforcement learning with human feedback (RLHF) on preference data. While these data often come from diverse labelers' groups (e.g., different…

Computation and Language · Computer Science 2024-05-31 Shyam Sundhar Ramesh , Yifan Hu , Iason Chaimalas , Viraj Mehta , Pier Giuseppe Sessa , Haitham Bou Ammar , Ilija Bogunovic

Large Language Models (LLMs) have become increasingly popular due to their ability to process and generate natural language. However, as they are trained on massive datasets of text, LLMs can inherit harmful biases and produce outputs that…

Computation and Language · Computer Science 2025-01-23 Qi Gou , Cam-Tu Nguyen

Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps:…

Machine Learning · Computer Science 2024-04-17 Jonathan D. Chang , Wenhao Zhan , Owen Oertell , Kianté Brantley , Dipendra Misra , Jason D. Lee , Wen Sun

Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single…

Machine Learning · Computer Science 2025-10-28 Junkang Wu , Kexin Huang , Xue Wang , Jinyang Gao , Bolin Ding , Jiancan Wu , Xiangnan He , Xiang Wang

Reinforcement learning from human feedback (RLHF) has emerged as a central tool for language model alignment. We consider online exploration in RLHF, which exploits interactive access to human or AI feedback by deliberately encouraging the…

Machine Learning · Computer Science 2024-06-03 Tengyang Xie , Dylan J. Foster , Akshay Krishnamurthy , Corby Rosset , Ahmed Awadallah , Alexander Rakhlin

Aligning large language models (LLMs) with human values and intentions is crucial for their utility, honesty, and safety. Reinforcement learning from human feedback (RLHF) is a popular approach to achieve this alignment, but it faces…

Machine Learning · Computer Science 2025-07-22 Junkang Wu , Xue Wang , Zhengyi Yang , Jiancan Wu , Jinyang Gao , Bolin Ding , Xiang Wang , Xiangnan He

Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather…

Artificial Intelligence · Computer Science 2026-05-21 Zhiqin Yang , Yonggang Zhang , Wei Xue , Dong Fang , Bo Han , Yike Guo

In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards -- a challenging scenario in traditional deep reinforcement learning.…

Machine Learning · Computer Science 2025-05-22 Han Zhong , Zikang Shan , Guhao Feng , Wei Xiong , Xinle Cheng , Li Zhao , Di He , Jiang Bian , Liwei Wang

Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences. Direct Preference Optimization (DPO), one of the most popular approaches, formulates RLHF as a…

Machine Learning · Computer Science 2024-10-10 Jiafan He , Huizhuo Yuan , Quanquan Gu

AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is increasingly treated as a crucial ingredient for high performance large language models. Proximal Policy Optimization (PPO) has been positioned by recent…

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing…

Machine Learning · Computer Science 2024-07-31 Rafael Rafailov , Archit Sharma , Eric Mitchell , Stefano Ermon , Christopher D. Manning , Chelsea Finn

Reinforcement learning from human feedback (RLHF) has been extensively employed to align large language models with user intent. However, proximal policy optimization (PPO) based RLHF is occasionally unstable requiring significant…

Computation and Language · Computer Science 2024-04-02 Saeed Khaki , JinJin Li , Lan Ma , Liu Yang , Prathap Ramachandra
‹ Prev 1 2 3 10 Next ›