Related papers: RLHFless: Serverless Computing for Efficient RLHF

HybridFlow: A Flexible and Efficient RLHF Framework

Reinforcement Learning from Human Feedback (RLHF) is widely used in Large Language Model (LLM) alignment. Traditional RL can be modeled as a dataflow, where each node represents computation of a neural network (NN) and each edge denotes…

Machine Learning · Computer Science 2024-10-03 Guangming Sheng , Chi Zhang , Zilingfeng Ye , Xibin Wu , Wang Zhang , Ru Zhang , Yanghua Peng , Haibin Lin , Chuan Wu

Reinforcement Learning from Human Feedback: A Statistical Perspective

Reinforcement learning from human feedback (RLHF) has emerged as a central framework for aligning large language models (LLMs) with human preferences. Despite its practical success, RLHF raises fundamental statistical questions because it…

Machine Learning · Statistics 2026-04-06 Pangpang Liu , Chengchun Shi , Will Wei Sun

RLHF Workflow: From Reward Modeling to Online RLHF

We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model…

Machine Learning · Computer Science 2024-11-13 Hanze Dong , Wei Xiong , Bo Pang , Haoxiang Wang , Han Zhao , Yingbo Zhou , Nan Jiang , Doyen Sahoo , Caiming Xiong , Tong Zhang

Reinforcement Learning in the Era of LLMs: What is Essential? What is needed? An RL Perspective on RLHF, Prompting, and Beyond

Recent advancements in Large Language Models (LLMs) have garnered wide attention and led to successful products such as ChatGPT and GPT-4. Their proficiency in adhering to instructions and delivering harmless, helpful, and honest (3H)…

Machine Learning · Computer Science 2023-10-11 Hao Sun

Provably Efficient Online RLHF with One-Pass Reward Modeling

Reinforcement Learning from Human Feedback (RLHF) has shown remarkable success in aligning Large Language Models (LLMs) with human preferences. Traditional RLHF methods rely on a fixed dataset, which often suffers from limited coverage. To…

Machine Learning · Computer Science 2025-10-28 Long-Fei Li , Yu-Yang Qian , Peng Zhao , Zhi-Hua Zhou

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

State-of-the-art large language models (LLMs) have become indispensable tools for various tasks. However, training LLMs to serve as effective assistants for humans requires careful consideration. A promising approach is reinforcement…

Machine Learning · Computer Science 2024-04-17 Shreyas Chaudhari , Pranjal Aggarwal , Vishvak Murahari , Tanmay Rajpurohit , Ashwin Kalyan , Karthik Narasimhan , Ameet Deshpande , Bruno Castro da Silva

Does RLHF Scale? Exploring the Impacts From Data, Model, and Method

This study explores the scaling properties of Reinforcement Learning from Human Feedback (RLHF) in Large Language Models (LLMs). Although RLHF is considered an important step in post-training of LLMs, its scaling potential is still largely…

Computation and Language · Computer Science 2024-12-10 Zhenyu Hou , Pengfan Du , Yilin Niu , Zhengxiao Du , Aohan Zeng , Xiao Liu , Minlie Huang , Hongning Wang , Jie Tang , Yuxiao Dong

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Large Language Models (LLMs) fine-tuned via Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) significantly improve the alignment of human-AI values, further raising the upper bound…

Artificial Intelligence · Computer Science 2025-10-10 Jian Hu , Xibin Wu , Wei Shen , Jason Klein Liu , Zilin Zhu , Weixun Wang , Songlin Jiang , Haoran Wang , Hao Chen , Bin Chen , Weikai Fang , Xianyu , Yu Cao , Haotian Xu , Yiming Liu

Parameter Efficient Reinforcement Learning from Human Feedback

While Reinforcement Learning from Human Feedback (RLHF) effectively aligns pretrained Large Language and Vision-Language Models (LLMs, and VLMs) with human preferences, its computational cost and complexity hamper its wider adoption. To…

Machine Learning · Computer Science 2024-09-16 Hakim Sidahmed , Samrat Phatale , Alex Hutcheson , Zhuonan Lin , Zhang Chen , Zac Yu , Jarvis Jin , Simral Chaudhary , Roman Komarytsia , Christiane Ahlheim , Yonghao Zhu , Bowen Li , Saravanan Ganesh , Bill Byrne , Jessica Hoffmann , Hassan Mansoor , Wei Li , Abhinav Rastogi , Lucas Dixon

A Survey of Reinforcement Learning from Human Feedback

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning (RL) that learns from human feedback instead of relying on an engineered reward function. Building on prior work on the related setting of…

Machine Learning · Computer Science 2025-12-30 Timo Kaufmann , Paul Weng , Viktor Bengs , Eyke Hüllermeier

MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences,…

Computation and Language · Computer Science 2025-02-18 Yekun Chai , Haoran Sun , Huang Fang , Shuohuan Wang , Yu Sun , Hua Wu

ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation

Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for empowering large language model (LLM) applications. Compared with the supervised training process of LLMs, the RLHF training process is much more sophisticated,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-25 Zhiyu Mei , Wei Fu , Kaiwei Li , Guangju Wang , Huanchen Zhang , Yi Wu

Teaching Large Language Models to Reason with Reinforcement Learning

Reinforcement Learning from Human Feedback (\textbf{RLHF}) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from…

Machine Learning · Computer Science 2024-03-08 Alex Havrilla , Yuqing Du , Sharath Chandra Raparthy , Christoforos Nalmpantis , Jane Dwivedi-Yu , Maksym Zhuravinskyi , Eric Hambro , Sainbayar Sukhbaatar , Roberta Raileanu

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations,…

Computation and Language · Computer Science 2022-04-13 Yuntao Bai , Andy Jones , Kamal Ndousse , Amanda Askell , Anna Chen , Nova DasSarma , Dawn Drain , Stanislav Fort , Deep Ganguli , Tom Henighan , Nicholas Joseph , Saurav Kadavath , Jackson Kernion , Tom Conerly , Sheer El-Showk , Nelson Elhage , Zac Hatfield-Dodds , Danny Hernandez , Tristan Hume , Scott Johnston , Shauna Kravec , Liane Lovitt , Neel Nanda , Catherine Olsson , Dario Amodei , Tom Brown , Jack Clark , Sam McCandlish , Chris Olah , Ben Mann , Jared Kaplan

The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback

Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) more capable in complex settings. RLHF proceeds as collecting human preference data, training a reward model on said…

Machine Learning · Computer Science 2024-02-05 Nathan Lambert , Roberto Calandra

G-Core: A Simple, Scalable and Balanced RLHF Trainer

Reinforcement Learning from Human Feedback (RLHF) has become an increasingly popular paradigm for training large language models (LLMs) and diffusion models. While existing RLHF training systems have enabled significant progress, they often…

Machine Learning · Computer Science 2025-08-01 Junyu Wu , Weiming Chang , Xiaotao Liu , Guanyou He , Haoqiang Hong , Boqi Liu , Hongtao Tian , Tao Yang , Yunsheng Shi , Feng Lin , Ting Yao

Optimizing RLHF Training for Large Language Models with Stage Fusion

We present RLHFuse, an efficient training system with stage fusion for Reinforcement Learning from Human Feedback (RLHF). Due to the intrinsic nature of RLHF training, i.e., the data skewness in the generation stage and the pipeline bubbles…

Machine Learning · Computer Science 2025-04-23 Yinmin Zhong , Zili Zhang , Bingyang Wu , Shengyu Liu , Yukun Chen , Changyi Wan , Hanpeng Hu , Lei Xia , Ranchen Ming , Yibo Zhu , Xin Jin

Safe RLHF: Safe Reinforcement Learning from Human Feedback

With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness…

Artificial Intelligence · Computer Science 2023-10-20 Josef Dai , Xuehai Pan , Ruiyang Sun , Jiaming Ji , Xinbo Xu , Mickel Liu , Yizhou Wang , Yaodong Yang

RRHF: Rank Responses to Align Language Models with Human Feedback without tears

Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models with human preferences, significantly enhancing the quality of interactions between humans and models. InstructGPT implements RLHF through…

Computation and Language · Computer Science 2023-10-10 Zheng Yuan , Hongyi Yuan , Chuanqi Tan , Wei Wang , Songfang Huang , Fei Huang

SuperHF: Supervised Iterative Learning from Human Feedback

While large language models demonstrate remarkable capabilities, they often present challenges in terms of safety, alignment with human values, and stability during training. Here, we focus on two prevalent methods used to align these…

Computation and Language · Computer Science 2023-10-26 Gabriel Mukobi , Peter Chatain , Su Fong , Robert Windesheim , Gitta Kutyniok , Kush Bhatia , Silas Alberti