English
Related papers

Related papers: Self-Play Preference Optimization for Language Mod…

200 papers

Self-play methods have demonstrated remarkable success in enhancing model capabilities across various domains. In the context of Reinforcement Learning from Human Feedback (RLHF), self-play not only boosts Large Language Model (LLM)…

Computation and Language · Computer Science 2025-04-22 Mingzhi Wang , Chengdong Ma , Qizhi Chen , Linjian Meng , Yang Han , Jiancong Xiao , Zhaowei Zhang , Jing Huo , Weijie J. Su , Yaodong Yang

Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption,…

Machine Learning · Computer Science 2025-03-04 Yuheng Zhang , Dian Yu , Baolin Peng , Linfeng Song , Ye Tian , Mingyue Huo , Nan Jiang , Haitao Mi , Dong Yu

Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models with human preferences. However, reward-based methods grounded in the Bradley-Terry assumption struggle to capture the…

Artificial Intelligence · Computer Science 2026-04-08 Fang Wu , Xu Huang , Weihao Xuan , Zhiwei Zhang , Yijia Xiao , Guancheng Wan , Xiaomin Li , Bing Hu , Peng Xia , Jure Leskovec , Yejin Choi

Self-play alignment has emerged as an effective approach for fine-tuning large language models (LLMs), formulating preference optimization as a two-player game. However, the regularization with respect to the reference policy, which is…

Machine Learning · Computer Science 2025-07-09 Xiaohang Tang , Sangwoong Yoon , Seongho Son , Huizhuo Yuan , Quanquan Gu , Ilija Bogunovic

Traditional language model alignment methods, such as Direct Preference Optimization (DPO), are limited by their dependence on static, pre-collected paired preference data, which hampers their adaptability and practical applicability. To…

Computation and Language · Computer Science 2024-06-03 Yueqin Yin , Zhendong Wang , Yujia Xie , Weizhu Chen , Mingyuan Zhou

Reinforcement learning from human feedback (RLHF) has become essential for improving language model capabilities, but traditional approaches rely on the assumption that human preferences follow a transitive Bradley-Terry model. This…

Machine Learning · Computer Science 2025-07-10 Runlong Zhou , Maryam Fazel , Simon S. Du

Traditional Reinforcement Learning from Human Feedback (RLHF) often relies on reward models, frequently assuming preference structures like the Bradley--Terry model, which may not accurately capture the complexities of real human…

Reinforcement Learning from Human Feedback (RLHF) has been highly successful in aligning large language models with human preferences. While prevalent methods like DPO have demonstrated strong performance, they frame interactions with the…

Machine Learning · Computer Science 2025-05-27 Yongtao Wu , Luca Viano , Yihang Chen , Zhenyu Zhu , Kimon Antonakopoulos , Quanquan Gu , Volkan Cevher

Reinforcement learning from human feedback (RLHF) has been popular for aligning text-to-image (T2I) diffusion models with human preferences. As a mainstream branch of RLHF, Direct Preference Optimization (DPO) offers a computationally…

Machine Learning · Computer Science 2026-05-07 Jiaming Hu , Jiamu Bai , Haoyu Wang , Debarghya Mukherjee , Ioannis Ch. Paschalidis

Large language models (LLMs), despite their extensive pretraining on diverse datasets, require effective alignment to human preferences for practical and reliable deployment. Conventional alignment methods typically employ off-policy…

Computation and Language · Computer Science 2025-07-29 Hyeonji Lee , Daejin Jo , Seohwan Yun , Sungwoong Kim

Aligning large language models (LLMs) with human preferences typically demands vast amounts of meticulously curated data, which is both expensive and prone to labeling noise. We propose Stackelberg Game Preference Optimization (SGPO), a…

Machine Learning · Computer Science 2026-01-22 Xu Chu , Zhixin Zhang , Tianyu Jia , Yujie Jin

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing…

Machine Learning · Computer Science 2024-07-31 Rafael Rafailov , Archit Sharma , Eric Mitchell , Stefano Ermon , Christopher D. Manning , Chelsea Finn

Online and offline RLHF methods, such as PPO and DPO, have been highly successful in aligning AI with human preferences. Despite their success, however, these methods suffer from fundamental limitations: (a) Models trained with RLHF can…

Machine Learning · Computer Science 2025-04-15 Eugene Choi , Arash Ahmadian , Matthieu Geist , Oilvier Pietquin , Mohammad Gheshlaghi Azar

Recently, there has been significant interest in replacing the reward model in Reinforcement Learning with Human Feedback (RLHF) methods for Large Language Models (LLMs), such as Direct Preference Optimization (DPO) and its variants. These…

Computation and Language · Computer Science 2024-09-27 Jian Li , Haojing Huang , Yujia Zhang , Pengfei Xu , Xi Chen , Rui Song , Lida Shi , Jingwen Wang , Hao Xu

This paper studies post-training large language models (LLMs) using preference feedback from a powerful oracle to help a model iteratively improve over itself. The typical approach for post-training LLMs involves Reinforcement Learning from…

Machine Learning · Computer Science 2024-04-08 Corby Rosset , Ching-An Cheng , Arindam Mitra , Michael Santacroce , Ahmed Awadallah , Tengyang Xie

Reinforcement learning from human feedback (RLHF) has demonstrated remarkable effectiveness in aligning large language models (LLMs) with human preferences. Many existing alignment approaches rely on the Bradley-Terry (BT) model assumption,…

Machine Learning · Computer Science 2025-02-25 Yuheng Zhang , Dian Yu , Tao Ge , Linfeng Song , Zhichen Zeng , Haitao Mi , Nan Jiang , Dong Yu

Large language models (LLMs) have attracted significant attention in recommendation systems. Current work primarily applies supervised fine-tuning (SFT) to adapt the model for recommendation tasks. However, SFT on positive examples only…

Information Retrieval · Computer Science 2025-02-07 Chongming Gao , Ruijun Chen , Shuai Yuan , Kexin Huang , Yuanqing Yu , Xiangnan He

This work studies the challenge of aligning large language models (LLMs) with offline preference data. We focus on alignment by Reinforcement Learning from Human Feedback (RLHF) in particular. While popular preference optimization methods…

Machine Learning · Computer Science 2024-06-07 Xiang Ji , Sanjeev Kulkarni , Mengdi Wang , Tengyang Xie

Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences. Direct Preference Optimization (DPO), one of the most popular approaches, formulates RLHF as a…

Machine Learning · Computer Science 2024-10-10 Jiafan He , Huizhuo Yuan , Quanquan Gu

Aligning Large Language Models (LLMs) with human feedback is crucial for their development. Existing preference optimization methods such as DPO and KTO, while improved based on Reinforcement Learning from Human Feedback (RLHF), are…

Computation and Language · Computer Science 2024-12-23 Shuo Xie , Fangzhi Zhu , Jiahui Wang , Lulu Wen , Wei Dai , Xiaowei Chen , Junxiong Zhu , Kai Zhou , Bo Zheng
‹ Prev 1 2 3 10 Next ›