Related papers: Self-Play Preference Optimization for Language Mod…

Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Model Alignment

Self-play methods have demonstrated remarkable success in enhancing model capabilities across various domains. In the context of Reinforcement Learning from Human Feedback (RLHF), self-play not only boosts Large Language Model (LLM)…

Computation and Language · Computer Science 2025-04-22 Mingzhi Wang , Chengdong Ma , Qizhi Chen , Linjian Meng , Yang Han , Jiancong Xiao , Zhaowei Zhang , Jing Huo , Weijie J. Su , Yaodong Yang

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption,…

Machine Learning · Computer Science 2025-03-04 Yuheng Zhang , Dian Yu , Baolin Peng , Linfeng Song , Ye Tian , Mingyue Huo , Nan Jiang , Haitao Mi , Dong Yu

Multiplayer Nash Preference Optimization

Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models with human preferences. However, reward-based methods grounded in the Bradley-Terry assumption struggle to capture the…

Artificial Intelligence · Computer Science 2026-04-08 Fang Wu , Xu Huang , Weihao Xuan , Zhiwei Zhang , Yijia Xiao , Guancheng Wan , Xiaomin Li , Bing Hu , Peng Xia , Jure Leskovec , Yejin Choi

RSPO: Regularized Self-Play Alignment of Large Language Models

Self-play alignment has emerged as an effective approach for fine-tuning large language models (LLMs), formulating preference optimization as a two-player game. However, the regularization with respect to the reference policy, which is…

Machine Learning · Computer Science 2025-07-09 Xiaohang Tang , Sangwoong Yoon , Seongho Son , Huizhuo Yuan , Quanquan Gu , Ilija Bogunovic

Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment

Traditional language model alignment methods, such as Direct Preference Optimization (DPO), are limited by their dependence on static, pre-collected paired preference data, which hampers their adaptability and practical applicability. To…

Computation and Language · Computer Science 2024-06-03 Yueqin Yin , Zhendong Wang , Yujia Xie , Weizhu Chen , Mingyuan Zhou

Extragradient Preference Optimization (EGPO): Beyond Last-Iterate Convergence for Nash Learning from Human Feedback

Reinforcement learning from human feedback (RLHF) has become essential for improving language model capabilities, but traditional approaches rely on the assumption that human preferences follow a transitive Bradley-Terry model. This…

Machine Learning · Computer Science 2025-07-10 Runlong Zhou , Maryam Fazel , Simon S. Du

Proximal Point Nash Learning from Human Feedback

Traditional Reinforcement Learning from Human Feedback (RLHF) often relies on reward models, frequently assuming preference structures like the Bradley--Terry model, which may not accurately capture the complexities of real human…

Machine Learning · Statistics 2026-03-24 Daniil Tiapkin , Daniele Calandriello , Denis Belomestny , Eric Moulines , Alexey Naumov , Kashif Rasul , Michal Valko , Pierre Menard

Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees

Reinforcement Learning from Human Feedback (RLHF) has been highly successful in aligning large language models with human preferences. While prevalent methods like DPO have demonstrated strong performance, they frame interactions with the…

Machine Learning · Computer Science 2025-05-27 Yongtao Wu , Luca Viano , Yihang Chen , Zhenyu Zhu , Kimon Antonakopoulos , Quanquan Gu , Volkan Cevher

Towards General Preference Alignment: Diffusion Models at Nash Equilibrium

Reinforcement learning from human feedback (RLHF) has been popular for aligning text-to-image (T2I) diffusion models with human preferences. As a mainstream branch of RLHF, Direct Preference Optimization (DPO) offers a computationally…

Machine Learning · Computer Science 2026-05-07 Jiaming Hu , Jiamu Bai , Haoyu Wang , Debarghya Mukherjee , Ioannis Ch. Paschalidis

SGPO: Self-Generated Preference Optimization based on Self-Improver

Large language models (LLMs), despite their extensive pretraining on diverse datasets, require effective alignment to human preferences for practical and reliable deployment. Conventional alignment methods typically employ off-policy…

Computation and Language · Computer Science 2025-07-29 Hyeonji Lee , Daejin Jo , Seohwan Yun , Sungwoong Kim

Stackelberg Self-Annotation: A Robust Approach to Data-Efficient LLM Alignment

Aligning large language models (LLMs) with human preferences typically demands vast amounts of meticulously curated data, which is both expensive and prone to labeling noise. We propose Stackelberg Game Preference Optimization (SGPO), a…

Machine Learning · Computer Science 2026-01-22 Xu Chu , Zhixin Zhang , Tianyu Jia , Yujie Jin

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing…

Machine Learning · Computer Science 2024-07-31 Rafael Rafailov , Archit Sharma , Eric Mitchell , Stefano Ermon , Christopher D. Manning , Chelsea Finn

Self-Improving Robust Preference Optimization

Online and offline RLHF methods, such as PPO and DPO, have been highly successful in aligning AI with human preferences. Despite their success, however, these methods suffer from fundamental limitations: (a) Models trained with RLHF can…

Machine Learning · Computer Science 2025-04-15 Eugene Choi , Arash Ahmadian , Matthieu Geist , Oilvier Pietquin , Mohammad Gheshlaghi Azar

Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

Recently, there has been significant interest in replacing the reward model in Reinforcement Learning with Human Feedback (RLHF) methods for Large Language Models (LLMs), such as Direct Preference Optimization (DPO) and its variants. These…

Computation and Language · Computer Science 2024-09-27 Jian Li , Haojing Huang , Yujia Zhang , Pengfei Xu , Xi Chen , Rui Song , Lida Shi , Jingwen Wang , Hao Xu

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

This paper studies post-training large language models (LLMs) using preference feedback from a powerful oracle to help a model iteratively improve over itself. The typical approach for post-training LLMs involves Reinforcement Learning from…

Machine Learning · Computer Science 2024-04-08 Corby Rosset , Ching-An Cheng , Arindam Mitra , Michael Santacroce , Ahmed Awadallah , Tengyang Xie

Improving LLM General Preference Alignment via Optimistic Online Mirror Descent

Reinforcement learning from human feedback (RLHF) has demonstrated remarkable effectiveness in aligning large language models (LLMs) with human preferences. Many existing alignment approaches rely on the Bradley-Terry (BT) model assumption,…

Machine Learning · Computer Science 2025-02-25 Yuheng Zhang , Dian Yu , Tao Ge , Linfeng Song , Zhichen Zeng , Haitao Mi , Nan Jiang , Dong Yu

SPRec: Self-Play to Debias LLM-based Recommendation

Large language models (LLMs) have attracted significant attention in recommendation systems. Current work primarily applies supervised fine-tuning (SFT) to adapt the model for recommendation tasks. However, SFT on positive examples only…

Information Retrieval · Computer Science 2025-02-07 Chongming Gao , Ruijun Chen , Shuai Yuan , Kexin Huang , Yuanqing Yu , Xiangnan He

Self-Play with Adversarial Critic: Provable and Scalable Offline Alignment for Language Models

This work studies the challenge of aligning large language models (LLMs) with offline preference data. We focus on alignment by Reinforcement Learning from Human Feedback (RLHF) in particular. While popular preference optimization methods…

Machine Learning · Computer Science 2024-06-07 Xiang Ji , Sanjeev Kulkarni , Mengdi Wang , Tengyang Xie

Accelerated Preference Optimization for Large Language Model Alignment

Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences. Direct Preference Optimization (DPO), one of the most popular approaches, formulates RLHF as a…

Machine Learning · Computer Science 2024-10-10 Jiafan He , Huizhuo Yuan , Quanquan Gu

MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples

Aligning Large Language Models (LLMs) with human feedback is crucial for their development. Existing preference optimization methods such as DPO and KTO, while improved based on Reinforcement Learning from Human Feedback (RLHF), are…

Computation and Language · Computer Science 2024-12-23 Shuo Xie , Fangzhi Zhu , Jiahui Wang , Lulu Wen , Wei Dai , Xiaowei Chen , Junxiong Zhu , Kai Zhou , Bo Zheng