Related papers: Making RL with Preference-based Feedback Efficient…

Efficient Reinforcement Learning from Human Feedback via Bayesian Preference Inference

Learning from human preferences is a cornerstone of aligning machine learning models with subjective human judgments. Yet, collecting such preference data is often costly and time-consuming, motivating the need for more efficient learning…

Machine Learning · Computer Science 2025-11-07 Matteo Cercola , Valeria Capretti , Simone Formentin

Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning

Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models (LLMs) with human preferences. To learn the reward function, most existing RLHF algorithms use the…

Machine Learning · Statistics 2026-02-11 Kai Ye , Hongyi Zhou , Jin Zhu , Francesco Quinzan , Chengchun Shi

Towards Efficient Online Exploration for Reinforcement Learning with Human Feedback

Reinforcement learning with human feedback (RLHF), which learns a reward model from human preference data and then optimizes a policy to favor preferred responses, has emerged as a central paradigm for aligning large language models (LLMs)…

Machine Learning · Statistics 2025-09-29 Gen Li , Yuling Yan

Sample-Efficient Reinforcement Learning from Human Feedback via Information-Directed Sampling

We study the problem of reinforcement learning from human feedback (RLHF), a critical problem in training large language models, from a theoretical perspective. Our main contribution is the design of novel sample-efficient RLHF algorithms…

Machine Learning · Computer Science 2025-08-11 Han Qi , Haochen Yang , Qiaosheng Zhang , Zhuoran Yang

Multi-turn Reinforcement Learning from Preference Human Feedback

Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models (LLMs) with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks. Existing methods work…

Machine Learning · Computer Science 2024-12-03 Lior Shani , Aviv Rosenberg , Asaf Cassel , Oran Lang , Daniele Calandriello , Avital Zipori , Hila Noga , Orgad Keller , Bilal Piot , Idan Szpektor , Avinatan Hassidim , Yossi Matias , Rémi Munos

Optimal Design for Reward Modeling in RLHF

Reinforcement Learning from Human Feedback (RLHF) has become a popular approach to align language models (LMs) with human preferences. This method involves collecting a large dataset of human pairwise preferences across various text…

Machine Learning · Computer Science 2024-10-24 Antoine Scheid , Etienne Boursier , Alain Durmus , Michael I. Jordan , Pierre Ménard , Eric Moulines , Michal Valko

Bayesian Optimization from Human Feedback: Near-Optimal Regret Bounds

Bayesian optimization (BO) with preference-based feedback has recently garnered significant attention due to its emerging applications. We refer to this problem as Bayesian Optimization from Human Feedback (BOHF), which differs from…

Machine Learning · Computer Science 2025-05-30 Aya Kayal , Sattar Vakili , Laura Toni , Da-shan Shiu , Alberto Bernacchia

Fine-Tuning Language Models with Reward Learning on Policy

Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences. RLHF contains three steps, i.e., human preference collecting, reward learning, and policy…

Computation and Language · Computer Science 2024-03-29 Hao Lang , Fei Huang , Yongbin Li

Regret Bounds for Reinforcement Learning from Multi-Source Imperfect Preferences

Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth…

Machine Learning · Computer Science 2026-04-03 Ming Shi , Yingbin Liang , Ness B. Shroff , Ananthram Swami

Teaching Large Language Models to Reason with Reinforcement Learning

Reinforcement Learning from Human Feedback (\textbf{RLHF}) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from…

Machine Learning · Computer Science 2024-03-08 Alex Havrilla , Yuqing Du , Sharath Chandra Raparthy , Christoforos Nalmpantis , Jane Dwivedi-Yu , Maksym Zhuravinskyi , Eric Hambro , Sainbayar Sukhbaatar , Roberta Raileanu

Reinforcement Learning from Human Feedback: A Statistical Perspective

Reinforcement learning from human feedback (RLHF) has emerged as a central framework for aligning large language models (LLMs) with human preferences. Despite its practical success, RLHF raises fundamental statistical questions because it…

Machine Learning · Statistics 2026-04-06 Pangpang Liu , Chengchun Shi , Will Wei Sun

Contrastive Preference Learning: Learning from Human Feedback without RL

Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular paradigm for aligning models with human intent. Typically RLHF algorithms operate in two phases: first, use human preferences to learn a reward function and second,…

Machine Learning · Computer Science 2024-05-01 Joey Hejna , Rafael Rafailov , Harshit Sikchi , Chelsea Finn , Scott Niekum , W. Bradley Knox , Dorsa Sadigh

Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning

Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning foundation models to human values and preferences. However, current RLHF techniques cannot account for the naturally occurring differences in individual…

Machine Learning · Computer Science 2024-08-20 Sriyash Poddar , Yanming Wan , Hamish Ivison , Abhishek Gupta , Natasha Jaques

The History and Risks of Reinforcement Learning and Human Feedback

Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) easier to use and more effective. A core piece of the RLHF process is the training and utilization of a model of…

Computers and Society · Computer Science 2023-11-29 Nathan Lambert , Thomas Krendl Gilbert , Tom Zick

Efficient Preference-Based Reinforcement Learning: Randomized Exploration Meets Experimental Design

We study reinforcement learning from human feedback in general Markov decision processes, where agents learn from trajectory-level preference comparisons. A central challenge in this setting is to design algorithms that select informative…

Machine Learning · Computer Science 2025-12-05 Andreas Schlaginhaufen , Reda Ouhamma , Maryam Kamgarpour

Models of human preference for learning reward functions

The utility of reinforcement learning is limited by the alignment of reward functions with the interests of human stakeholders. One promising method for alignment is to learn the reward function from human-generated preferences between…

Machine Learning · Computer Science 2023-09-08 W. Bradley Knox , Stephane Hatgis-Kessell , Serena Booth , Scott Niekum , Peter Stone , Alessandro Allievi

Learning Kernel-Based MDPs from Episodic Preferential Feedback

Human feedback often arrives as preferences rather than calibrated numeric rewards, motivating reinforcement learning from preferential feedback, also referred to as reinforcement learning from human feedback (RLHF). We present a rigorous…

Machine Learning · Statistics 2026-05-26 Nikola Pavlovic , Sattar Vakili , Qing Zhao

Is RLHF More Difficult than Standard RL?

Reinforcement learning from Human Feedback (RLHF) learns from preference signals, while standard Reinforcement Learning (RL) directly learns from reward signals. Preferences arguably contain less information than rewards, which makes…

Machine Learning · Computer Science 2023-11-07 Yuanhao Wang , Qinghua Liu , Chi Jin

Towards Understanding the Influence of Reward Margin on Preference Model Performance

Reinforcement Learning from Human Feedback (RLHF) is a widely used framework for the training of language models. However, the process of using RLHF to develop a language model that is well-aligned presents challenges, especially when it…

Computation and Language · Computer Science 2024-04-09 Bowen Qin , Duanyu Feng , Xi Yang

Best Policy Learning from Trajectory Preference Feedback

Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful approach for aligning generative models, but its reliance on learned reward models makes it vulnerable to mis-specification and reward hacking. Preference-based…

Machine Learning · Computer Science 2026-04-23 Akhil Agnihotri , Rahul Jain , Deepak Ramachandran , Zheng Wen