Related papers: Learning Kernel-Based MDPs from Episodic Preferent…

Models of human preference for learning reward functions

The utility of reinforcement learning is limited by the alignment of reward functions with the interests of human stakeholders. One promising method for alignment is to learn the reward function from human-generated preferences between…

Machine Learning · Computer Science 2023-09-08 W. Bradley Knox , Stephane Hatgis-Kessell , Serena Booth , Scott Niekum , Peter Stone , Alessandro Allievi

Making RL with Preference-based Feedback Efficient via Randomization

Reinforcement Learning algorithms that learn from human feedback (RLHF) need to be efficient in terms of statistical complexity, computational complexity, and query complexity. In this work, we consider the RLHF setting where the feedback…

Machine Learning · Computer Science 2024-03-14 Runzhe Wu , Wen Sun

Reinforcement Learning from Multi-level and Episodic Human Feedback

Designing an effective reward function has long been a challenge in reinforcement learning, particularly for complex tasks in unstructured environments. To address this, various learning paradigms have emerged that leverage different forms…

Machine Learning · Computer Science 2025-04-29 Muhammad Qasim Elahi , Somtochukwu Oguchienti , Maheed H. Ahmed , Mahsa Ghasemi

Regret Bounds for Reinforcement Learning from Multi-Source Imperfect Preferences

Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth…

Machine Learning · Computer Science 2026-04-03 Ming Shi , Yingbin Liang , Ness B. Shroff , Ananthram Swami

Learning Optimal Advantage from Preferences and Mistaking it for Reward

We consider algorithms for learning reward functions from human preferences over pairs of trajectory segments, as used in reinforcement learning from human feedback (RLHF). Most recent work assumes that human preferences are generated based…

Machine Learning · Computer Science 2023-10-05 W. Bradley Knox , Stephane Hatgis-Kessell , Sigurdur Orn Adalgeirsson , Serena Booth , Anca Dragan , Peter Stone , Scott Niekum

Fine-Tuning Language Models with Reward Learning on Policy

Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences. RLHF contains three steps, i.e., human preference collecting, reward learning, and policy…

Computation and Language · Computer Science 2024-03-29 Hao Lang , Fei Huang , Yongbin Li

Provable Multi-Party Reinforcement Learning with Diverse Human Feedback

Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences. Typically, RLHF aggregates preferences from multiple individuals who have diverse viewpoints that may conflict with each…

Machine Learning · Computer Science 2024-03-11 Huiying Zhong , Zhun Deng , Weijie J. Su , Zhiwei Steven Wu , Linjun Zhang

The History and Risks of Reinforcement Learning and Human Feedback

Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) easier to use and more effective. A core piece of the RLHF process is the training and utilization of a model of…

Computers and Society · Computer Science 2023-11-29 Nathan Lambert , Thomas Krendl Gilbert , Tom Zick

Reinforcement Learning from Human Feedback: A Statistical Perspective

Reinforcement learning from human feedback (RLHF) has emerged as a central framework for aligning large language models (LLMs) with human preferences. Despite its practical success, RLHF raises fundamental statistical questions because it…

Machine Learning · Statistics 2026-04-06 Pangpang Liu , Chengchun Shi , Will Wei Sun

Online Iterative Reinforcement Learning from Human Feedback with General Preference Model

We investigate Reinforcement Learning from Human Feedback (RLHF) in the context of a general preference oracle. In particular, we do not assume the existence of a reward function and an oracle preference signal drawn from the Bradley-Terry…

Machine Learning · Computer Science 2024-11-13 Chenlu Ye , Wei Xiong , Yuheng Zhang , Hanze Dong , Nan Jiang , Tong Zhang

From Demonstrations to Rewards: Alignment Without Explicit Human Preferences

One of the challenges of aligning large models with human preferences lies in both the data requirements and the technical complexities of current approaches. Predominant methods, such as RLHF, involve multiple steps, each demanding…

Machine Learning · Computer Science 2025-03-19 Siliang Zeng , Yao Liu , Huzefa Rangwala , George Karypis , Mingyi Hong , Rasool Fakoor

Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning

Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning foundation models to human values and preferences. However, current RLHF techniques cannot account for the naturally occurring differences in individual…

Machine Learning · Computer Science 2024-08-20 Sriyash Poddar , Yanming Wan , Hamish Ivison , Abhishek Gupta , Natasha Jaques

Policy-labeled Preference Learning: Is Preference Enough for RLHF?

To design rewards that align with human goals, Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent technique for learning reward functions from human preferences and optimizing policies via reinforcement learning…

Machine Learning · Computer Science 2025-05-14 Taehyun Cho , Seokhun Ju , Seungyub Han , Dohyeong Kim , Kyungjae Lee , Jungwoo Lee

Differentially Private Reward Estimation with Preference Feedback

Learning from preference-based feedback has recently gained considerable traction as a promising approach to align generative models with human interests. Instead of relying on numerical rewards, the generative models are trained using…

Machine Learning · Computer Science 2023-10-31 Sayak Ray Chowdhury , Xingyu Zhou , Nagarajan Natarajan

Best Policy Learning from Trajectory Preference Feedback

Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful approach for aligning generative models, but its reliance on learned reward models makes it vulnerable to mis-specification and reward hacking. Preference-based…

Machine Learning · Computer Science 2026-04-23 Akhil Agnihotri , Rahul Jain , Deepak Ramachandran , Zheng Wen

Contrastive Preference Learning: Learning from Human Feedback without RL

Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular paradigm for aligning models with human intent. Typically RLHF algorithms operate in two phases: first, use human preferences to learn a reward function and second,…

Machine Learning · Computer Science 2024-05-01 Joey Hejna , Rafael Rafailov , Harshit Sikchi , Chelsea Finn , Scott Niekum , W. Bradley Knox , Dorsa Sadigh

Influencing Humans to Conform to Preference Models for RLHF

Designing a reinforcement learning from human feedback (RLHF) algorithm to approximate a human's unobservable reward function requires assuming, implicitly or explicitly, a model of human preferences. A preference model that poorly…

Machine Learning · Computer Science 2026-04-14 Stephane Hatgis-Kessell , W. Bradley Knox , Serena Booth , Peter Stone

Robust Reinforcement Learning from Corrupted Human Feedback

Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data. For various reasons, e.g., personal bias, context ambiguity, lack of training, etc, human annotators may…

Machine Learning · Computer Science 2024-07-10 Alexander Bukharin , Ilgee Hong , Haoming Jiang , Zichong Li , Qingru Zhang , Zixuan Zhang , Tuo Zhao

LRHP: Learning Representations for Human Preferences via Preference Pairs

To improve human-preference alignment training, current research has developed numerous preference datasets consisting of preference pairs labeled as "preferred" or "dispreferred". These preference pairs are typically used to encode human…

Computation and Language · Computer Science 2024-10-08 Chenglong Wang , Yang Gan , Yifu Huo , Yongyu Mu , Qiaozhi He , Murun Yang , Tong Xiao , Chunliang Zhang , Tongran Liu , Jingbo Zhu

Towards Efficient Online Exploration for Reinforcement Learning with Human Feedback

Reinforcement learning with human feedback (RLHF), which learns a reward model from human preference data and then optimizes a policy to favor preferred responses, has emerged as a central paradigm for aligning large language models (LLMs)…

Machine Learning · Statistics 2025-09-29 Gen Li , Yuling Yan