Related papers: Unifying Stable Optimization and Reference Regular…

$f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses

Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone technique for post-training large language models. While most existing approaches rely on the reverse KL-regularization, recent empirical studies have begun…

Machine Learning · Computer Science 2026-05-11 Di Wu , Chengshuai Shi , Jing Yang , Cong Shen

Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles

Reinforcement learning from human feedback (RLHF) emerges as a promising paradigm for aligning large language models (LLMs). However, a notable challenge in RLHF is overoptimization, where beyond a certain threshold, the pursuit of higher…

Machine Learning · Computer Science 2024-01-02 Yuanzhao Zhai , Han Zhang , Yu Lei , Yue Yu , Kele Xu , Dawei Feng , Bo Ding , Huaimin Wang

REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback

The effectiveness of reinforcement learning (RL) agents in continuous control robotics tasks is mainly dependent on the design of the underlying reward function, which is highly prone to reward hacking. A misalignment between the reward…

Robotics · Computer Science 2025-01-22 Souradip Chakraborty , Anukriti Singh , Amisha Bhaskar , Pratap Tokekar , Dinesh Manocha , Amrit Singh Bedi

Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs

Reward models trained on human preference data have been proven to effectively align Large Language Models (LLMs) with human intent within the framework of reinforcement learning from human feedback (RLHF). However, current reward models…

Computation and Language · Computer Science 2024-10-24 Rui Yang , Ruomeng Ding , Yong Lin , Huan Zhang , Tong Zhang

ROCM: RLHF on consistency models

Diffusion models have revolutionized generative modeling in continuous domains like image, audio, and video synthesis. However, their iterative sampling process leads to slow generation and inefficient training, challenges that are further…

Machine Learning · Computer Science 2025-03-11 Shivanshu Shekhar , Tong Zhang

Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards

Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs). A common problem is reward hacking, where the policy may exploit inaccuracies of the…

Machine Learning · Computer Science 2026-02-23 Johannes Ackermann , Michael Noukhovitch , Takashi Ishida , Masashi Sugiyama

SuperHF: Supervised Iterative Learning from Human Feedback

While large language models demonstrate remarkable capabilities, they often present challenges in terms of safety, alignment with human values, and stability during training. Here, we focus on two prevalent methods used to align these…

Computation and Language · Computer Science 2023-10-26 Gabriel Mukobi , Peter Chatain , Su Fong , Robert Windesheim , Gitta Kutyniok , Kush Bhatia , Silas Alberti

Beyond RLHF: A Unified Theoretical Framework of Alignment

Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, existing theories do not provide strong justification for…

Machine Learning · Computer Science 2026-05-19 Jihun Yun , Juno Kim , Jongho Park , Junhyuck Kim , Jongha Jon Ryu , Jaewoong Cho , Kwang-Sung Jun

Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

Aligning human preference and value is an important requirement for building contemporary foundation models and embodied AI. However, popular approaches such as reinforcement learning with human feedback (RLHF) break down the task into…

Artificial Intelligence · Computer Science 2024-12-03 Chenliang Li , Siliang Zeng , Zeyi Liao , Jiaxiang Li , Dongyeop Kang , Alfredo Garcia , Mingyi Hong

RLHF in an SFT Way: From Optimal Solution to Reward-Weighted Alignment

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. However, RLHF has been continuously challenged by its high complexity in implementation and computation consumption,…

Machine Learning · Computer Science 2026-03-24 Yuhao Du , Zhuo Li , Pengyu Cheng , Zhihong Chen , Yuejiao Xie , Xiang Wan , Anningzhe Gao

Because it is difficult to precisely specify complex objectives, reinforcement learning policies are often optimized using proxy reward functions that only approximate the true goal. However, optimizing proxy rewards frequently leads to…

Machine Learning · Computer Science 2025-03-14 Cassidy Laidlaw , Shivam Singhal , Anca Dragan

Solving the Inverse Alignment Problem for Efficient RLHF

Collecting high-quality preference datasets for reinforcement learning from human feedback (RLHF) is resource-intensive and challenging. As a result, researchers often train reward models on extensive offline datasets which aggregate…

Machine Learning · Computer Science 2024-12-17 Shambhavi Krishna , Aishwarya Sahoo

Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

Aligning generative models with human preference via RLHF typically suffers from overoptimization, where an imperfectly learned reward model can misguide the generative model to output undesired responses. We investigate this problem in a…

Machine Learning · Computer Science 2024-12-05 Zhihan Liu , Miao Lu , Shenao Zhang , Boyi Liu , Hongyi Guo , Yingxiang Yang , Jose Blanchet , Zhaoran Wang

Reward-Robust RLHF in LLMs

As Large Language Models (LLMs) continue to progress toward more advanced forms of intelligence, Reinforcement Learning from Human Feedback (RLHF) is increasingly seen as a key pathway toward achieving Artificial General Intelligence (AGI).…

Machine Learning · Computer Science 2024-10-17 Yuzi Yan , Xingzhou Lou , Jialian Li , Yiping Zhang , Jian Xie , Chao Yu , Yu Wang , Dong Yan , Yuan Shen

Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent success of Large Language Models (LLMs), however, it is often a complex and brittle process. In the classical RLHF framework, a reward model is first trained…

Machine Learning · Computer Science 2024-11-06 Rafael Rafailov , Yaswanth Chittepu , Ryan Park , Harshit Sikchi , Joey Hejna , Bradley Knox , Chelsea Finn , Scott Niekum

Mitigating Preference Hacking in Policy Optimization with Pessimism

This work tackles the problem of overoptimization in reinforcement learning from human feedback (RLHF), a prevalent technique for aligning models with human preferences. RLHF relies on reward or preference models trained on \emph{fixed…

Machine Learning · Computer Science 2025-03-11 Dhawal Gupta , Adam Fisch , Christoph Dann , Alekh Agarwal

Real-Time Aligned Reward Model beyond Semantics

Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for aligning large language models (LLMs) with human preferences, yet it is susceptible to reward overoptimization, in which policy models overfit to the reward model,…

Artificial Intelligence · Computer Science 2026-05-19 Zixuan Huang , Xin Xia , Yuxi Ren , Jianbin Zheng , Xuefeng Xiao , Hongyan Xie , Li Huaqiu , Songshi Liang , Zhongxiang Dai , Fuzhen Zhuang , Jianxin Li , Yikun Ban , Deqing Wang

A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization

Reinforcement Learning from Human Feedback (RLHF) has emerged as a important paradigm for aligning large language models (LLMs) with human preferences during post-training. This framework typically involves two stages: first, training a…

Machine Learning · Computer Science 2025-04-08 Wenyuan Xu , Xiaochen Zuo , Chao Xin , Yu Yue , Lin Yan , Yonghui Wu

Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization

Reinforcement Learning from Human Feedback (RLHF) leverages a Kullback-Leibler (KL) divergence loss to stabilize training and prevent overfitting. However, in methods such as GRPO, its implementation may be guided by principles from…

Machine Learning · Computer Science 2025-10-07 Kezhao Liu , Jason Klein Liu , Mingtao Chen , Yiming Liu

Policy-labeled Preference Learning: Is Preference Enough for RLHF?

To design rewards that align with human goals, Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent technique for learning reward functions from human preferences and optimizing policies via reinforcement learning…

Machine Learning · Computer Science 2025-05-14 Taehyun Cho , Seokhun Ju , Seungyub Han , Dohyeong Kim , Kyungjae Lee , Jungwoo Lee