English
Related papers

Related papers: Learning Kernel-Based MDPs from Episodic Preferent…

200 papers

The utility of reinforcement learning is limited by the alignment of reward functions with the interests of human stakeholders. One promising method for alignment is to learn the reward function from human-generated preferences between…

Machine Learning · Computer Science 2023-09-08 W. Bradley Knox , Stephane Hatgis-Kessell , Serena Booth , Scott Niekum , Peter Stone , Alessandro Allievi

Reinforcement Learning algorithms that learn from human feedback (RLHF) need to be efficient in terms of statistical complexity, computational complexity, and query complexity. In this work, we consider the RLHF setting where the feedback…

Machine Learning · Computer Science 2024-03-14 Runzhe Wu , Wen Sun

Designing an effective reward function has long been a challenge in reinforcement learning, particularly for complex tasks in unstructured environments. To address this, various learning paradigms have emerged that leverage different forms…

Machine Learning · Computer Science 2025-04-29 Muhammad Qasim Elahi , Somtochukwu Oguchienti , Maheed H. Ahmed , Mahsa Ghasemi

Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth…

Machine Learning · Computer Science 2026-04-03 Ming Shi , Yingbin Liang , Ness B. Shroff , Ananthram Swami

We consider algorithms for learning reward functions from human preferences over pairs of trajectory segments, as used in reinforcement learning from human feedback (RLHF). Most recent work assumes that human preferences are generated based…

Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences. RLHF contains three steps, i.e., human preference collecting, reward learning, and policy…

Computation and Language · Computer Science 2024-03-29 Hao Lang , Fei Huang , Yongbin Li

Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences. Typically, RLHF aggregates preferences from multiple individuals who have diverse viewpoints that may conflict with each…

Machine Learning · Computer Science 2024-03-11 Huiying Zhong , Zhun Deng , Weijie J. Su , Zhiwei Steven Wu , Linjun Zhang

Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) easier to use and more effective. A core piece of the RLHF process is the training and utilization of a model of…

Computers and Society · Computer Science 2023-11-29 Nathan Lambert , Thomas Krendl Gilbert , Tom Zick

Reinforcement learning from human feedback (RLHF) has emerged as a central framework for aligning large language models (LLMs) with human preferences. Despite its practical success, RLHF raises fundamental statistical questions because it…

Machine Learning · Statistics 2026-04-06 Pangpang Liu , Chengchun Shi , Will Wei Sun

We investigate Reinforcement Learning from Human Feedback (RLHF) in the context of a general preference oracle. In particular, we do not assume the existence of a reward function and an oracle preference signal drawn from the Bradley-Terry…

Machine Learning · Computer Science 2024-11-13 Chenlu Ye , Wei Xiong , Yuheng Zhang , Hanze Dong , Nan Jiang , Tong Zhang

One of the challenges of aligning large models with human preferences lies in both the data requirements and the technical complexities of current approaches. Predominant methods, such as RLHF, involve multiple steps, each demanding…

Machine Learning · Computer Science 2025-03-19 Siliang Zeng , Yao Liu , Huzefa Rangwala , George Karypis , Mingyi Hong , Rasool Fakoor

Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning foundation models to human values and preferences. However, current RLHF techniques cannot account for the naturally occurring differences in individual…

Machine Learning · Computer Science 2024-08-20 Sriyash Poddar , Yanming Wan , Hamish Ivison , Abhishek Gupta , Natasha Jaques

To design rewards that align with human goals, Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent technique for learning reward functions from human preferences and optimizing policies via reinforcement learning…

Machine Learning · Computer Science 2025-05-14 Taehyun Cho , Seokhun Ju , Seungyub Han , Dohyeong Kim , Kyungjae Lee , Jungwoo Lee

Learning from preference-based feedback has recently gained considerable traction as a promising approach to align generative models with human interests. Instead of relying on numerical rewards, the generative models are trained using…

Machine Learning · Computer Science 2023-10-31 Sayak Ray Chowdhury , Xingyu Zhou , Nagarajan Natarajan

Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful approach for aligning generative models, but its reliance on learned reward models makes it vulnerable to mis-specification and reward hacking. Preference-based…

Machine Learning · Computer Science 2026-04-23 Akhil Agnihotri , Rahul Jain , Deepak Ramachandran , Zheng Wen

Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular paradigm for aligning models with human intent. Typically RLHF algorithms operate in two phases: first, use human preferences to learn a reward function and second,…

Machine Learning · Computer Science 2024-05-01 Joey Hejna , Rafael Rafailov , Harshit Sikchi , Chelsea Finn , Scott Niekum , W. Bradley Knox , Dorsa Sadigh

Designing a reinforcement learning from human feedback (RLHF) algorithm to approximate a human's unobservable reward function requires assuming, implicitly or explicitly, a model of human preferences. A preference model that poorly…

Machine Learning · Computer Science 2026-04-14 Stephane Hatgis-Kessell , W. Bradley Knox , Serena Booth , Peter Stone

Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data. For various reasons, e.g., personal bias, context ambiguity, lack of training, etc, human annotators may…

Machine Learning · Computer Science 2024-07-10 Alexander Bukharin , Ilgee Hong , Haoming Jiang , Zichong Li , Qingru Zhang , Zixuan Zhang , Tuo Zhao

To improve human-preference alignment training, current research has developed numerous preference datasets consisting of preference pairs labeled as "preferred" or "dispreferred". These preference pairs are typically used to encode human…

Computation and Language · Computer Science 2024-10-08 Chenglong Wang , Yang Gan , Yifu Huo , Yongyu Mu , Qiaozhi He , Murun Yang , Tong Xiao , Chunliang Zhang , Tongran Liu , Jingbo Zhu

Reinforcement learning with human feedback (RLHF), which learns a reward model from human preference data and then optimizes a policy to favor preferred responses, has emerged as a central paradigm for aligning large language models (LLMs)…

Machine Learning · Statistics 2025-09-29 Gen Li , Yuling Yan
‹ Prev 1 2 3 10 Next ›