English
Related papers

Related papers: Unifying Stable Optimization and Reference Regular…

200 papers

Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone technique for post-training large language models. While most existing approaches rely on the reverse KL-regularization, recent empirical studies have begun…

Machine Learning · Computer Science 2026-05-11 Di Wu , Chengshuai Shi , Jing Yang , Cong Shen

Reinforcement learning from human feedback (RLHF) emerges as a promising paradigm for aligning large language models (LLMs). However, a notable challenge in RLHF is overoptimization, where beyond a certain threshold, the pursuit of higher…

Machine Learning · Computer Science 2024-01-02 Yuanzhao Zhai , Han Zhang , Yu Lei , Yue Yu , Kele Xu , Dawei Feng , Bo Ding , Huaimin Wang

The effectiveness of reinforcement learning (RL) agents in continuous control robotics tasks is mainly dependent on the design of the underlying reward function, which is highly prone to reward hacking. A misalignment between the reward…

Reward models trained on human preference data have been proven to effectively align Large Language Models (LLMs) with human intent within the framework of reinforcement learning from human feedback (RLHF). However, current reward models…

Computation and Language · Computer Science 2024-10-24 Rui Yang , Ruomeng Ding , Yong Lin , Huan Zhang , Tong Zhang

Diffusion models have revolutionized generative modeling in continuous domains like image, audio, and video synthesis. However, their iterative sampling process leads to slow generation and inefficient training, challenges that are further…

Machine Learning · Computer Science 2025-03-11 Shivanshu Shekhar , Tong Zhang

Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs). A common problem is reward hacking, where the policy may exploit inaccuracies of the…

Machine Learning · Computer Science 2026-02-23 Johannes Ackermann , Michael Noukhovitch , Takashi Ishida , Masashi Sugiyama

While large language models demonstrate remarkable capabilities, they often present challenges in terms of safety, alignment with human values, and stability during training. Here, we focus on two prevalent methods used to align these…

Computation and Language · Computer Science 2023-10-26 Gabriel Mukobi , Peter Chatain , Su Fong , Robert Windesheim , Gitta Kutyniok , Kush Bhatia , Silas Alberti

Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, existing theories do not provide strong justification for…

Machine Learning · Computer Science 2026-05-19 Jihun Yun , Juno Kim , Jongho Park , Junhyuck Kim , Jongha Jon Ryu , Jaewoong Cho , Kwang-Sung Jun

Aligning human preference and value is an important requirement for building contemporary foundation models and embodied AI. However, popular approaches such as reinforcement learning with human feedback (RLHF) break down the task into…

Artificial Intelligence · Computer Science 2024-12-03 Chenliang Li , Siliang Zeng , Zeyi Liao , Jiaxiang Li , Dongyeop Kang , Alfredo Garcia , Mingyi Hong

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. However, RLHF has been continuously challenged by its high complexity in implementation and computation consumption,…

Machine Learning · Computer Science 2026-03-24 Yuhao Du , Zhuo Li , Pengyu Cheng , Zhihong Chen , Yuejiao Xie , Xiang Wan , Anningzhe Gao

Because it is difficult to precisely specify complex objectives, reinforcement learning policies are often optimized using proxy reward functions that only approximate the true goal. However, optimizing proxy rewards frequently leads to…

Machine Learning · Computer Science 2025-03-14 Cassidy Laidlaw , Shivam Singhal , Anca Dragan

Collecting high-quality preference datasets for reinforcement learning from human feedback (RLHF) is resource-intensive and challenging. As a result, researchers often train reward models on extensive offline datasets which aggregate…

Machine Learning · Computer Science 2024-12-17 Shambhavi Krishna , Aishwarya Sahoo

Aligning generative models with human preference via RLHF typically suffers from overoptimization, where an imperfectly learned reward model can misguide the generative model to output undesired responses. We investigate this problem in a…

Machine Learning · Computer Science 2024-12-05 Zhihan Liu , Miao Lu , Shenao Zhang , Boyi Liu , Hongyi Guo , Yingxiang Yang , Jose Blanchet , Zhaoran Wang

As Large Language Models (LLMs) continue to progress toward more advanced forms of intelligence, Reinforcement Learning from Human Feedback (RLHF) is increasingly seen as a key pathway toward achieving Artificial General Intelligence (AGI).…

Machine Learning · Computer Science 2024-10-17 Yuzi Yan , Xingzhou Lou , Jialian Li , Yiping Zhang , Jian Xie , Chao Yu , Yu Wang , Dong Yan , Yuan Shen

Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent success of Large Language Models (LLMs), however, it is often a complex and brittle process. In the classical RLHF framework, a reward model is first trained…

Machine Learning · Computer Science 2024-11-06 Rafael Rafailov , Yaswanth Chittepu , Ryan Park , Harshit Sikchi , Joey Hejna , Bradley Knox , Chelsea Finn , Scott Niekum

This work tackles the problem of overoptimization in reinforcement learning from human feedback (RLHF), a prevalent technique for aligning models with human preferences. RLHF relies on reward or preference models trained on \emph{fixed…

Machine Learning · Computer Science 2025-03-11 Dhawal Gupta , Adam Fisch , Christoph Dann , Alekh Agarwal

Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for aligning large language models (LLMs) with human preferences, yet it is susceptible to reward overoptimization, in which policy models overfit to the reward model,…

Reinforcement Learning from Human Feedback (RLHF) has emerged as a important paradigm for aligning large language models (LLMs) with human preferences during post-training. This framework typically involves two stages: first, training a…

Machine Learning · Computer Science 2025-04-08 Wenyuan Xu , Xiaochen Zuo , Chao Xin , Yu Yue , Lin Yan , Yonghui Wu

Reinforcement Learning from Human Feedback (RLHF) leverages a Kullback-Leibler (KL) divergence loss to stabilize training and prevent overfitting. However, in methods such as GRPO, its implementation may be guided by principles from…

Machine Learning · Computer Science 2025-10-07 Kezhao Liu , Jason Klein Liu , Mingtao Chen , Yiming Liu

To design rewards that align with human goals, Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent technique for learning reward functions from human preferences and optimizing policies via reinforcement learning…

Machine Learning · Computer Science 2025-05-14 Taehyun Cho , Seokhun Ju , Seungyub Han , Dohyeong Kim , Kyungjae Lee , Jungwoo Lee
‹ Prev 1 2 3 10 Next ›