Related papers: SLIME: Stabilized Likelihood Implicit Margin Enfor…

Offline Preference Optimization via Maximum Marginal Likelihood Estimation

Aligning Large Language Models (LLMs) with human preferences is crucial, but standard methods like Reinforcement Learning from Human Feedback (RLHF) are often complex and unstable. In this work, we propose a new, simpler approach that…

Machine Learning · Computer Science 2026-01-27 Saeed Najafi , Alona Fyshe

LIRE: listwise reward enhancement for preference alignment

Recently, tremendous strides have been made to align the generation of Large Language Models (LLMs) with human values to mitigate toxic or unhelpful content. Leveraging Reinforcement Learning from Human Feedback (RLHF) proves effective and…

Computation and Language · Computer Science 2024-06-05 Mingye Zhu , Yi Liu , Lei Zhang , Junbo Guo , Zhendong Mao

Larger or Smaller Reward Margins to Select Preferences for Alignment?

Preference learning is critical for aligning large language models (LLMs) with human values, with the quality of preference datasets playing a crucial role in this process. While existing metrics primarily assess data quality based on…

Machine Learning · Computer Science 2025-03-05 Kexin Huang , Junkang Wu , Ziqian Chen , Xue Wang , Jinyang Gao , Bolin Ding , Jiancan Wu , Xiangnan He , Xiang Wang

From RLHF to Direct Alignment: A Theoretical Unification of Preference Learning for Large Language Models

Aligning large language models (LLMs) with human preferences has become essential for safe and beneficial AI deployment. While Reinforcement Learning from Human Feedback (RLHF) established the dominant paradigm, a proliferation of…

Artificial Intelligence · Computer Science 2026-01-13 Tarun Raheja , Nilay Pochhi

PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment

We address the problem of making a pre-trained reinforcement learning (RL) policy safety-aware by incorporating cost constraints without retraining it from scratch. While costs could be numerically encoded, we assume a more general setting…

Machine Learning · Computer Science 2026-05-21 Richa Verma , Bavish Kulur , Sanjay Chawla , Balaraman Ravindran

Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

Recently, there has been significant interest in replacing the reward model in Reinforcement Learning with Human Feedback (RLHF) methods for Large Language Models (LLMs), such as Direct Preference Optimization (DPO) and its variants. These…

Computation and Language · Computer Science 2024-09-27 Jian Li , Haojing Huang , Yujia Zhang , Pengfei Xu , Xi Chen , Rui Song , Lida Shi , Jingwen Wang , Hao Xu

$i$REPO: $i$mplicit Reward Pairwise Difference based Empirical Preference Optimization

While astonishingly capable, large Language Models (LLM) can sometimes produce outputs that deviate from human expectations. Such deviations necessitate an alignment phase to prevent disseminating untruthful, toxic, or biased information.…

Artificial Intelligence · Computer Science 2024-10-30 Long Tan Le , Han Shu , Tung-Anh Nguyen , Choong Seon Hong , Nguyen H. Tran

Beyond RLHF: A Unified Theoretical Framework of Alignment

Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, existing theories do not provide strong justification for…

Machine Learning · Computer Science 2026-05-19 Jihun Yun , Juno Kim , Jongho Park , Junhyuck Kim , Jongha Jon Ryu , Jaewoong Cho , Kwang-Sung Jun

SAIL: Self-Improving Efficient Online Alignment of Large Language Models

Reinforcement Learning from Human Feedback (RLHF) is a key method for aligning large language models (LLMs) with human preferences. However, current offline alignment approaches like DPO, IPO, and SLiC rely heavily on fixed preference…

Machine Learning · Computer Science 2024-06-25 Mucong Ding , Souradip Chakraborty , Vibhu Agrawal , Zora Che , Alec Koppel , Mengdi Wang , Amrit Bedi , Furong Huang

Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback

While textless Spoken Language Models (SLMs) have shown potential in end-to-end speech-to-speech modeling, they still lag behind text-based Large Language Models (LLMs) in terms of semantic coherence and relevance. This work introduces the…

Computation and Language · Computer Science 2025-05-28 Guan-Ting Lin , Prashanth Gurunath Shivakumar , Aditya Gourav , Yile Gu , Ankur Gandhe , Hung-yi Lee , Ivan Bulyko

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

Preference alignment in Large Language Models (LLMs) has significantly improved their ability to adhere to human instructions and intentions. However, existing direct alignment algorithms primarily focus on relative preferences and often…

Machine Learning · Computer Science 2025-05-13 Shenao Zhang , Zhihan Liu , Boyi Liu , Yufeng Zhang , Yingxiang Yang , Yongfei Liu , Liyu Chen , Tao Sun , Zhaoran Wang

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

Accurately aligning large language models (LLMs) with human preferences is crucial for informing fair, economically sound, and statistically efficient decision-making processes. However, we argue that the predominant approach for aligning…

Machine Learning · Statistics 2025-08-26 Jiancong Xiao , Ziniu Li , Xingyu Xie , Emily Getzen , Cong Fang , Qi Long , Weijie J. Su

Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

Aligning generative models with human preference via RLHF typically suffers from overoptimization, where an imperfectly learned reward model can misguide the generative model to output undesired responses. We investigate this problem in a…

Machine Learning · Computer Science 2024-12-05 Zhihan Liu , Miao Lu , Shenao Zhang , Boyi Liu , Hongyi Guo , Yingxiang Yang , Jose Blanchet , Zhaoran Wang

Leveraging Robust Optimization for LLM Alignment under Distribution Shifts

Preference alignment methods are increasingly critical for steering large language models (LLMs) to generate outputs consistent with human values. While recent approaches often rely on synthetic data generated by LLMs for scalability and…

Computation and Language · Computer Science 2025-10-21 Mingye Zhu , Yi Liu , Zheren Fu , Yongdong Zhang , Zhendong Mao

Self-Play with Adversarial Critic: Provable and Scalable Offline Alignment for Language Models

This work studies the challenge of aligning large language models (LLMs) with offline preference data. We focus on alignment by Reinforcement Learning from Human Feedback (RLHF) in particular. While popular preference optimization methods…

Machine Learning · Computer Science 2024-06-07 Xiang Ji , Sanjeev Kulkarni , Mengdi Wang , Tengyang Xie

Statistical Rejection Sampling Improves Preference Optimization

Improving the alignment of language models with human preferences remains an active research challenge. Previous approaches have primarily utilized Reinforcement Learning from Human Feedback (RLHF) via online RL methods such as Proximal…

Computation and Language · Computer Science 2024-01-25 Tianqi Liu , Yao Zhao , Rishabh Joshi , Misha Khalman , Mohammad Saleh , Peter J. Liu , Jialu Liu

Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models (LLMs) to adhere to human intentions. Unlike offline alignment with a fixed…

Machine Learning · Computer Science 2024-11-06 Shenao Zhang , Donghan Yu , Hiteshi Sharma , Han Zhong , Zhihan Liu , Ziyi Yang , Shuohang Wang , Hany Hassan , Zhaoran Wang

A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement

Reinforcement Learning from Human Feedback (RLHF) has become the predominant approach for language model (LM) alignment. At its core, RLHF uses a margin-based loss for preference optimization, specifying ideal LM behavior only by the…

Machine Learning · Computer Science 2025-04-23 Hui Yuan , Yifan Zeng , Yue Wu , Huazheng Wang , Mengdi Wang , Liu Leqi

Policy-labeled Preference Learning: Is Preference Enough for RLHF?

To design rewards that align with human goals, Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent technique for learning reward functions from human preferences and optimizing policies via reinforcement learning…

Machine Learning · Computer Science 2025-05-14 Taehyun Cho , Seokhun Ju , Seungyub Han , Dohyeong Kim , Kyungjae Lee , Jungwoo Lee

Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling

Reinforcement Learning (RL) algorithms for safety alignment of Large Language Models (LLMs), such as Direct Preference Optimization (DPO), encounter the challenge of distribution shift. Current approaches typically address this issue…

Computation and Language · Computer Science 2025-06-17 Qiyuan Deng , Xuefeng Bai , Kehai Chen , Yaowei Wang , Liqiang Nie , Min Zhang