Related papers: Parameter Efficient Reinforcement Learning from Hu…

Teaching Large Language Models to Reason with Reinforcement Learning

Reinforcement Learning from Human Feedback (\textbf{RLHF}) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from…

Machine Learning · Computer Science 2024-03-08 Alex Havrilla , Yuqing Du , Sharath Chandra Raparthy , Christoforos Nalmpantis , Jane Dwivedi-Yu , Maksym Zhuravinskyi , Eric Hambro , Sainbayar Sukhbaatar , Roberta Raileanu

A Survey of Reinforcement Learning from Human Feedback

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning (RL) that learns from human feedback instead of relying on an engineered reward function. Building on prior work on the related setting of…

Machine Learning · Computer Science 2025-12-30 Timo Kaufmann , Paul Weng , Viktor Bengs , Eyke Hüllermeier

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values. However, RLHF relies on a reward model that is trained with a limited amount of human preference data,…

Machine Learning · Computer Science 2024-10-23 Shun Zhang , Zhenfang Chen , Sunli Chen , Yikang Shen , Zhiqing Sun , Chuang Gan

MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences,…

Computation and Language · Computer Science 2025-02-18 Yekun Chai , Haoran Sun , Huang Fang , Shuohuan Wang , Yu Sun , Hua Wu

RRHF: Rank Responses to Align Language Models with Human Feedback without tears

Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models with human preferences, significantly enhancing the quality of interactions between humans and models. InstructGPT implements RLHF through…

Computation and Language · Computer Science 2023-10-10 Zheng Yuan , Hongyi Yuan , Chuanqi Tan , Wei Wang , Songfang Huang , Fei Huang

Reinforcement Learning from Human Feedback: A Statistical Perspective

Reinforcement learning from human feedback (RLHF) has emerged as a central framework for aligning large language models (LLMs) with human preferences. Despite its practical success, RLHF raises fundamental statistical questions because it…

Machine Learning · Statistics 2026-04-06 Pangpang Liu , Chengchun Shi , Will Wei Sun

SuperHF: Supervised Iterative Learning from Human Feedback

While large language models demonstrate remarkable capabilities, they often present challenges in terms of safety, alignment with human values, and stability during training. Here, we focus on two prevalent methods used to align these…

Computation and Language · Computer Science 2023-10-26 Gabriel Mukobi , Peter Chatain , Su Fong , Robert Windesheim , Gitta Kutyniok , Kush Bhatia , Silas Alberti

Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning

Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models (LLMs) with human preferences. To learn the reward function, most existing RLHF algorithms use the…

Machine Learning · Statistics 2026-02-11 Kai Ye , Hongyi Zhou , Jin Zhu , Francesco Quinzan , Chengchun Shi

Fine-tuning Language Models with Generative Adversarial Reward Modelling

Reinforcement Learning with Human Feedback (RLHF) has been demonstrated to significantly enhance the performance of large language models (LLMs) by aligning their outputs with desired human values through instruction tuning. However, RLHF…

Computation and Language · Computer Science 2024-03-06 Zhang Ze Yu , Lau Jia Jaw , Zhang Hui , Bryan Kian Hsiang Low

Self-Evolved Reward Learning for LLMs

Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences, playing a pivotal role in the success of conversational models like GPT-4, ChatGPT, and Llama 2. A core challenge…

Computation and Language · Computer Science 2025-06-04 Chenghua Huang , Zhizhen Fan , Lu Wang , Fangkai Yang , Pu Zhao , Zeqi Lin , Qingwei Lin , Dongmei Zhang , Saravan Rajmohan , Qi Zhang

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

State-of-the-art large language models (LLMs) have become indispensable tools for various tasks. However, training LLMs to serve as effective assistants for humans requires careful consideration. A promising approach is reinforcement…

Machine Learning · Computer Science 2024-04-17 Shreyas Chaudhari , Pranjal Aggarwal , Vishvak Murahari , Tanmay Rajpurohit , Ashwin Kalyan , Karthik Narasimhan , Ameet Deshpande , Bruno Castro da Silva

Language Model Personalization via Reward Factorization

Modern large language models (LLMs) are optimized for human-aligned responses using Reinforcement Learning from Human Feedback (RLHF). However, existing RLHF approaches assume a universal preference model and fail to account for individual…

Machine Learning · Computer Science 2025-03-11 Idan Shenfeld , Felix Faltings , Pulkit Agrawal , Aldo Pacchiano

RLHF in an SFT Way: From Optimal Solution to Reward-Weighted Alignment

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. However, RLHF has been continuously challenged by its high complexity in implementation and computation consumption,…

Machine Learning · Computer Science 2026-03-24 Yuhao Du , Zhuo Li , Pengyu Cheng , Zhihong Chen , Yuejiao Xie , Xiang Wan , Anningzhe Gao

The History and Risks of Reinforcement Learning and Human Feedback

Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) easier to use and more effective. A core piece of the RLHF process is the training and utilization of a model of…

Computers and Society · Computer Science 2023-11-29 Nathan Lambert , Thomas Krendl Gilbert , Tom Zick

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations,…

Computation and Language · Computer Science 2022-04-13 Yuntao Bai , Andy Jones , Kamal Ndousse , Amanda Askell , Anna Chen , Nova DasSarma , Dawn Drain , Stanislav Fort , Deep Ganguli , Tom Henighan , Nicholas Joseph , Saurav Kadavath , Jackson Kernion , Tom Conerly , Sheer El-Showk , Nelson Elhage , Zac Hatfield-Dodds , Danny Hernandez , Tristan Hume , Scott Johnston , Shauna Kravec , Liane Lovitt , Neel Nanda , Catherine Olsson , Dario Amodei , Tom Brown , Jack Clark , Sam McCandlish , Chris Olah , Ben Mann , Jared Kaplan

Reinforcement Learning in the Era of LLMs: What is Essential? What is needed? An RL Perspective on RLHF, Prompting, and Beyond

Recent advancements in Large Language Models (LLMs) have garnered wide attention and led to successful products such as ChatGPT and GPT-4. Their proficiency in adhering to instructions and delivering harmless, helpful, and honest (3H)…

Machine Learning · Computer Science 2023-10-11 Hao Sun

Trustworthy Human-AI Collaboration: Reinforcement Learning with Human Feedback and Physics Knowledge for Safe Autonomous Driving

In the field of autonomous driving, developing safe and trustworthy autonomous driving policies remains a significant challenge. Recently, Reinforcement Learning with Human Feedback (RLHF) has attracted substantial attention due to its…

Robotics · Computer Science 2024-09-06 Zilin Huang , Zihao Sheng , Sikai Chen

Provably Efficient Online RLHF with One-Pass Reward Modeling

Reinforcement Learning from Human Feedback (RLHF) has shown remarkable success in aligning Large Language Models (LLMs) with human preferences. Traditional RLHF methods rely on a fixed dataset, which often suffers from limited coverage. To…

Machine Learning · Computer Science 2025-10-28 Long-Fei Li , Yu-Yang Qian , Peng Zhao , Zhi-Hua Zhou

Safe RLHF: Safe Reinforcement Learning from Human Feedback

With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness…

Artificial Intelligence · Computer Science 2023-10-20 Josef Dai , Xuehai Pan , Ruiyang Sun , Jiaming Ji , Xinbo Xu , Mickel Liu , Yizhou Wang , Yaodong Yang

Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning

Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning foundation models to human values and preferences. However, current RLHF techniques cannot account for the naturally occurring differences in individual…

Machine Learning · Computer Science 2024-08-20 Sriyash Poddar , Yanming Wan , Hamish Ivison , Abhishek Gupta , Natasha Jaques