Related papers: SEAL: Systematic Error Analysis for Value ALignmen…

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing…

Artificial Intelligence · Computer Science 2026-05-27 Dongyoon Hahm , Dylan Hadfield-Menell , Kimin Lee

The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback

Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) more capable in complex settings. RLHF proceeds as collecting human preference data, training a reward model on said…

Machine Learning · Computer Science 2024-02-05 Nathan Lambert , Roberto Calandra

Reinforcement Learning from Human Feedback: A Statistical Perspective

Reinforcement learning from human feedback (RLHF) has emerged as a central framework for aligning large language models (LLMs) with human preferences. Despite its practical success, RLHF raises fundamental statistical questions because it…

Machine Learning · Statistics 2026-04-06 Pangpang Liu , Chengchun Shi , Will Wei Sun

Learning from Failures: Understanding LLM Alignment through Failure-Aware Inverse RL

Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) with human preferences, yet the underlying reward signals they internalize remain hidden, posing a critical challenge for interpretability and safety.…

Machine Learning · Computer Science 2026-01-21 Nyal Patel , Matthieu Bou , Arjun Jagota , Satyapriya Krishna , Sonali Parbhoo

Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning

Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning foundation models to human values and preferences. However, current RLHF techniques cannot account for the naturally occurring differences in individual…

Machine Learning · Computer Science 2024-08-20 Sriyash Poddar , Yanming Wan , Hamish Ivison , Abhishek Gupta , Natasha Jaques

Nash Learning from Human Feedback

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback,…

Machine Learning · Statistics 2024-06-12 Rémi Munos , Michal Valko , Daniele Calandriello , Mohammad Gheshlaghi Azar , Mark Rowland , Zhaohan Daniel Guo , Yunhao Tang , Matthieu Geist , Thomas Mesnard , Andrea Michi , Marco Selvi , Sertan Girgin , Nikola Momchev , Olivier Bachem , Daniel J. Mankowitz , Doina Precup , Bilal Piot

SAIL: Self-Improving Efficient Online Alignment of Large Language Models

Reinforcement Learning from Human Feedback (RLHF) is a key method for aligning large language models (LLMs) with human preferences. However, current offline alignment approaches like DPO, IPO, and SLiC rely heavily on fixed preference…

Machine Learning · Computer Science 2024-06-25 Mucong Ding , Souradip Chakraborty , Vibhu Agrawal , Zora Che , Alec Koppel , Mengdi Wang , Amrit Bedi , Furong Huang

Secrets of RLHF in Large Language Models Part II: Reward Modeling

Reinforcement Learning from Human Feedback (RLHF) has become a crucial technology for aligning language models with human values and intentions, enabling models to produce more helpful and harmless responses. Reward models are trained as…

Artificial Intelligence · Computer Science 2024-01-15 Binghai Wang , Rui Zheng , Lu Chen , Yan Liu , Shihan Dou , Caishuang Huang , Wei Shen , Senjie Jin , Enyu Zhou , Chenyu Shi , Songyang Gao , Nuo Xu , Yuhao Zhou , Xiaoran Fan , Zhiheng Xi , Jun Zhao , Xiao Wang , Tao Ji , Hang Yan , Lixing Shen , Zhan Chen , Tao Gui , Qi Zhang , Xipeng Qiu , Xuanjing Huang , Zuxuan Wu , Yu-Gang Jiang

A Survey of Reinforcement Learning from Human Feedback

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning (RL) that learns from human feedback instead of relying on an engineered reward function. Building on prior work on the related setting of…

Machine Learning · Computer Science 2025-12-30 Timo Kaufmann , Paul Weng , Viktor Bengs , Eyke Hüllermeier

Real-Time Aligned Reward Model beyond Semantics

Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for aligning large language models (LLMs) with human preferences, yet it is susceptible to reward overoptimization, in which policy models overfit to the reward model,…

Artificial Intelligence · Computer Science 2026-05-19 Zixuan Huang , Xin Xia , Yuxi Ren , Jianbin Zheng , Xuefeng Xiao , Hongyan Xie , Li Huaqiu , Songshi Liang , Zhongxiang Dai , Fuzhen Zhuang , Jianxin Li , Yikun Ban , Deqing Wang

Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

Aligning the behavior of Large language models (LLMs) with human intentions and values remains a critical challenge. Reinforcement learning from human feedback (RLHF) aligns LLMs by training a reward model (RM) on human preferences and…

Computation and Language · Computer Science 2025-12-25 Jiayi Zhou , Jiaming Ji , Juntao Dai , Dong Li , Yaodong Yang

Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment

Aligning human preference and value is an important requirement for contemporary foundation models. State-of-the-art techniques such as Reinforcement Learning from Human Feedback (RLHF) often consist of two stages: 1) supervised fine-tuning…

Artificial Intelligence · Computer Science 2024-10-29 Jiaxiang Li , Siliang Zeng , Hoi-To Wai , Chenliang Li , Alfredo Garcia , Mingyi Hong

Safe RLHF: Safe Reinforcement Learning from Human Feedback

With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness…

Artificial Intelligence · Computer Science 2023-10-20 Josef Dai , Xuehai Pan , Ruiyang Sun , Jiaming Ji , Xinbo Xu , Mickel Liu , Yizhou Wang , Yaodong Yang

Adaptive Preference Scaling for Reinforcement Learning with Human Feedback

Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values by learning rewards from human preference data. Due to various reasons, however, such data typically takes the form of rankings…

Machine Learning · Computer Science 2024-06-06 Ilgee Hong , Zichong Li , Alexander Bukharin , Yixiao Li , Haoming Jiang , Tianbao Yang , Tuo Zhao

Aligning Large Language Models with Human Preferences through Representation Engineering

Aligning large language models (LLMs) with human preferences is crucial for enhancing their utility in terms of helpfulness, truthfulness, safety, harmlessness, and interestingness. Existing methods for achieving this alignment often…

Computation and Language · Computer Science 2024-07-04 Wenhao Liu , Xiaohua Wang , Muling Wu , Tianlong Li , Changze Lv , Zixuan Ling , Jianhao Zhu , Cenyuan Zhang , Xiaoqing Zheng , Xuanjing Huang

Self-Evolved Reward Learning for LLMs

Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences, playing a pivotal role in the success of conversational models like GPT-4, ChatGPT, and Llama 2. A core challenge…

Computation and Language · Computer Science 2025-06-04 Chenghua Huang , Zhizhen Fan , Lu Wang , Fangkai Yang , Pu Zhao , Zeqi Lin , Qingwei Lin , Dongmei Zhang , Saravan Rajmohan , Qi Zhang

Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning

Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models (LLMs) with human preferences. To learn the reward function, most existing RLHF algorithms use the…

Machine Learning · Statistics 2026-02-11 Kai Ye , Hongyi Zhou , Jin Zhu , Francesco Quinzan , Chengchun Shi

When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning

While Reinforcement Learning from Human Feedback (RLHF) is widely used to align Large Language Models (LLMs) with human preferences, it typically assumes homogeneous preferences across users, overlooking diverse human values and minority…

Computation and Language · Computer Science 2025-10-28 Yijiang River Dong , Tiancheng Hu , Yinhong Liu , Ahmet Üstün , Nigel Collier

Towards Understanding the Influence of Reward Margin on Preference Model Performance

Reinforcement Learning from Human Feedback (RLHF) is a widely used framework for the training of language models. However, the process of using RLHF to develop a language model that is well-aligned presents challenges, especially when it…

Computation and Language · Computer Science 2024-04-09 Bowen Qin , Duanyu Feng , Xi Yang

It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF

Reinforcement Learning from Human Feedback (RLHF) involves training policy models (PMs) and reward models (RMs) to align language models with human preferences. Instead of focusing solely on PMs and RMs independently, we propose to examine…

Computation and Language · Computer Science 2024-06-14 Taiming Lu , Lingfeng Shen , Xinyu Yang , Weiting Tan , Beidi Chen , Huaxiu Yao