English
Related papers

Related papers: Parameter Efficient Reinforcement Learning from Hu…

200 papers

Reinforcement Learning from Human Feedback (\textbf{RLHF}) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from…

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning (RL) that learns from human feedback instead of relying on an engineered reward function. Building on prior work on the related setting of…

Machine Learning · Computer Science 2025-12-30 Timo Kaufmann , Paul Weng , Viktor Bengs , Eyke Hüllermeier

Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values. However, RLHF relies on a reward model that is trained with a limited amount of human preference data,…

Machine Learning · Computer Science 2024-10-23 Shun Zhang , Zhenfang Chen , Sunli Chen , Yikang Shen , Zhiqing Sun , Chuang Gan

Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences,…

Computation and Language · Computer Science 2025-02-18 Yekun Chai , Haoran Sun , Huang Fang , Shuohuan Wang , Yu Sun , Hua Wu

Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models with human preferences, significantly enhancing the quality of interactions between humans and models. InstructGPT implements RLHF through…

Computation and Language · Computer Science 2023-10-10 Zheng Yuan , Hongyi Yuan , Chuanqi Tan , Wei Wang , Songfang Huang , Fei Huang

Reinforcement learning from human feedback (RLHF) has emerged as a central framework for aligning large language models (LLMs) with human preferences. Despite its practical success, RLHF raises fundamental statistical questions because it…

Machine Learning · Statistics 2026-04-06 Pangpang Liu , Chengchun Shi , Will Wei Sun

While large language models demonstrate remarkable capabilities, they often present challenges in terms of safety, alignment with human values, and stability during training. Here, we focus on two prevalent methods used to align these…

Computation and Language · Computer Science 2023-10-26 Gabriel Mukobi , Peter Chatain , Su Fong , Robert Windesheim , Gitta Kutyniok , Kush Bhatia , Silas Alberti

Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models (LLMs) with human preferences. To learn the reward function, most existing RLHF algorithms use the…

Machine Learning · Statistics 2026-02-11 Kai Ye , Hongyi Zhou , Jin Zhu , Francesco Quinzan , Chengchun Shi

Reinforcement Learning with Human Feedback (RLHF) has been demonstrated to significantly enhance the performance of large language models (LLMs) by aligning their outputs with desired human values through instruction tuning. However, RLHF…

Computation and Language · Computer Science 2024-03-06 Zhang Ze Yu , Lau Jia Jaw , Zhang Hui , Bryan Kian Hsiang Low

Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences, playing a pivotal role in the success of conversational models like GPT-4, ChatGPT, and Llama 2. A core challenge…

Computation and Language · Computer Science 2025-06-04 Chenghua Huang , Zhizhen Fan , Lu Wang , Fangkai Yang , Pu Zhao , Zeqi Lin , Qingwei Lin , Dongmei Zhang , Saravan Rajmohan , Qi Zhang

State-of-the-art large language models (LLMs) have become indispensable tools for various tasks. However, training LLMs to serve as effective assistants for humans requires careful consideration. A promising approach is reinforcement…

Modern large language models (LLMs) are optimized for human-aligned responses using Reinforcement Learning from Human Feedback (RLHF). However, existing RLHF approaches assume a universal preference model and fail to account for individual…

Machine Learning · Computer Science 2025-03-11 Idan Shenfeld , Felix Faltings , Pulkit Agrawal , Aldo Pacchiano

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. However, RLHF has been continuously challenged by its high complexity in implementation and computation consumption,…

Machine Learning · Computer Science 2026-03-24 Yuhao Du , Zhuo Li , Pengyu Cheng , Zhihong Chen , Yuejiao Xie , Xiang Wan , Anningzhe Gao

Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) easier to use and more effective. A core piece of the RLHF process is the training and utilization of a model of…

Computers and Society · Computer Science 2023-11-29 Nathan Lambert , Thomas Krendl Gilbert , Tom Zick

We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations,…

Recent advancements in Large Language Models (LLMs) have garnered wide attention and led to successful products such as ChatGPT and GPT-4. Their proficiency in adhering to instructions and delivering harmless, helpful, and honest (3H)…

Machine Learning · Computer Science 2023-10-11 Hao Sun

In the field of autonomous driving, developing safe and trustworthy autonomous driving policies remains a significant challenge. Recently, Reinforcement Learning with Human Feedback (RLHF) has attracted substantial attention due to its…

Robotics · Computer Science 2024-09-06 Zilin Huang , Zihao Sheng , Sikai Chen

Reinforcement Learning from Human Feedback (RLHF) has shown remarkable success in aligning Large Language Models (LLMs) with human preferences. Traditional RLHF methods rely on a fixed dataset, which often suffers from limited coverage. To…

Machine Learning · Computer Science 2025-10-28 Long-Fei Li , Yu-Yang Qian , Peng Zhao , Zhi-Hua Zhou

With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness…

Artificial Intelligence · Computer Science 2023-10-20 Josef Dai , Xuehai Pan , Ruiyang Sun , Jiaming Ji , Xinbo Xu , Mickel Liu , Yizhou Wang , Yaodong Yang

Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning foundation models to human values and preferences. However, current RLHF techniques cannot account for the naturally occurring differences in individual…

Machine Learning · Computer Science 2024-08-20 Sriyash Poddar , Yanming Wan , Hamish Ivison , Abhishek Gupta , Natasha Jaques
‹ Prev 1 2 3 10 Next ›