English
Related papers

Related papers: SEAL: Systematic Error Analysis for Value ALignmen…

200 papers

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing…

Artificial Intelligence · Computer Science 2026-05-27 Dongyoon Hahm , Dylan Hadfield-Menell , Kimin Lee

Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) more capable in complex settings. RLHF proceeds as collecting human preference data, training a reward model on said…

Machine Learning · Computer Science 2024-02-05 Nathan Lambert , Roberto Calandra

Reinforcement learning from human feedback (RLHF) has emerged as a central framework for aligning large language models (LLMs) with human preferences. Despite its practical success, RLHF raises fundamental statistical questions because it…

Machine Learning · Statistics 2026-04-06 Pangpang Liu , Chengchun Shi , Will Wei Sun

Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) with human preferences, yet the underlying reward signals they internalize remain hidden, posing a critical challenge for interpretability and safety.…

Machine Learning · Computer Science 2026-01-21 Nyal Patel , Matthieu Bou , Arjun Jagota , Satyapriya Krishna , Sonali Parbhoo

Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning foundation models to human values and preferences. However, current RLHF techniques cannot account for the naturally occurring differences in individual…

Machine Learning · Computer Science 2024-08-20 Sriyash Poddar , Yanming Wan , Hamish Ivison , Abhishek Gupta , Natasha Jaques

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback,…

Reinforcement Learning from Human Feedback (RLHF) is a key method for aligning large language models (LLMs) with human preferences. However, current offline alignment approaches like DPO, IPO, and SLiC rely heavily on fixed preference…

Machine Learning · Computer Science 2024-06-25 Mucong Ding , Souradip Chakraborty , Vibhu Agrawal , Zora Che , Alec Koppel , Mengdi Wang , Amrit Bedi , Furong Huang

Reinforcement Learning from Human Feedback (RLHF) has become a crucial technology for aligning language models with human values and intentions, enabling models to produce more helpful and harmless responses. Reward models are trained as…

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning (RL) that learns from human feedback instead of relying on an engineered reward function. Building on prior work on the related setting of…

Machine Learning · Computer Science 2025-12-30 Timo Kaufmann , Paul Weng , Viktor Bengs , Eyke Hüllermeier

Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for aligning large language models (LLMs) with human preferences, yet it is susceptible to reward overoptimization, in which policy models overfit to the reward model,…

Aligning the behavior of Large language models (LLMs) with human intentions and values remains a critical challenge. Reinforcement learning from human feedback (RLHF) aligns LLMs by training a reward model (RM) on human preferences and…

Computation and Language · Computer Science 2025-12-25 Jiayi Zhou , Jiaming Ji , Juntao Dai , Dong Li , Yaodong Yang

Aligning human preference and value is an important requirement for contemporary foundation models. State-of-the-art techniques such as Reinforcement Learning from Human Feedback (RLHF) often consist of two stages: 1) supervised fine-tuning…

Artificial Intelligence · Computer Science 2024-10-29 Jiaxiang Li , Siliang Zeng , Hoi-To Wai , Chenliang Li , Alfredo Garcia , Mingyi Hong

With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness…

Artificial Intelligence · Computer Science 2023-10-20 Josef Dai , Xuehai Pan , Ruiyang Sun , Jiaming Ji , Xinbo Xu , Mickel Liu , Yizhou Wang , Yaodong Yang

Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values by learning rewards from human preference data. Due to various reasons, however, such data typically takes the form of rankings…

Machine Learning · Computer Science 2024-06-06 Ilgee Hong , Zichong Li , Alexander Bukharin , Yixiao Li , Haoming Jiang , Tianbao Yang , Tuo Zhao

Aligning large language models (LLMs) with human preferences is crucial for enhancing their utility in terms of helpfulness, truthfulness, safety, harmlessness, and interestingness. Existing methods for achieving this alignment often…

Computation and Language · Computer Science 2024-07-04 Wenhao Liu , Xiaohua Wang , Muling Wu , Tianlong Li , Changze Lv , Zixuan Ling , Jianhao Zhu , Cenyuan Zhang , Xiaoqing Zheng , Xuanjing Huang

Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences, playing a pivotal role in the success of conversational models like GPT-4, ChatGPT, and Llama 2. A core challenge…

Computation and Language · Computer Science 2025-06-04 Chenghua Huang , Zhizhen Fan , Lu Wang , Fangkai Yang , Pu Zhao , Zeqi Lin , Qingwei Lin , Dongmei Zhang , Saravan Rajmohan , Qi Zhang

Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models (LLMs) with human preferences. To learn the reward function, most existing RLHF algorithms use the…

Machine Learning · Statistics 2026-02-11 Kai Ye , Hongyi Zhou , Jin Zhu , Francesco Quinzan , Chengchun Shi

While Reinforcement Learning from Human Feedback (RLHF) is widely used to align Large Language Models (LLMs) with human preferences, it typically assumes homogeneous preferences across users, overlooking diverse human values and minority…

Computation and Language · Computer Science 2025-10-28 Yijiang River Dong , Tiancheng Hu , Yinhong Liu , Ahmet Üstün , Nigel Collier

Reinforcement Learning from Human Feedback (RLHF) is a widely used framework for the training of language models. However, the process of using RLHF to develop a language model that is well-aligned presents challenges, especially when it…

Computation and Language · Computer Science 2024-04-09 Bowen Qin , Duanyu Feng , Xi Yang

Reinforcement Learning from Human Feedback (RLHF) involves training policy models (PMs) and reward models (RMs) to align language models with human preferences. Instead of focusing solely on PMs and RMs independently, we propose to examine…

Computation and Language · Computer Science 2024-06-14 Taiming Lu , Lingfeng Shen , Xinyu Yang , Weiting Tan , Beidi Chen , Huaxiu Yao
‹ Prev 1 2 3 10 Next ›