English
Related papers

Related papers: SLIME: Stabilized Likelihood Implicit Margin Enfor…

200 papers

Aligning Large Language Models (LLMs) with human preferences is crucial, but standard methods like Reinforcement Learning from Human Feedback (RLHF) are often complex and unstable. In this work, we propose a new, simpler approach that…

Machine Learning · Computer Science 2026-01-27 Saeed Najafi , Alona Fyshe

Recently, tremendous strides have been made to align the generation of Large Language Models (LLMs) with human values to mitigate toxic or unhelpful content. Leveraging Reinforcement Learning from Human Feedback (RLHF) proves effective and…

Computation and Language · Computer Science 2024-06-05 Mingye Zhu , Yi Liu , Lei Zhang , Junbo Guo , Zhendong Mao

Preference learning is critical for aligning large language models (LLMs) with human values, with the quality of preference datasets playing a crucial role in this process. While existing metrics primarily assess data quality based on…

Machine Learning · Computer Science 2025-03-05 Kexin Huang , Junkang Wu , Ziqian Chen , Xue Wang , Jinyang Gao , Bolin Ding , Jiancan Wu , Xiangnan He , Xiang Wang

Aligning large language models (LLMs) with human preferences has become essential for safe and beneficial AI deployment. While Reinforcement Learning from Human Feedback (RLHF) established the dominant paradigm, a proliferation of…

Artificial Intelligence · Computer Science 2026-01-13 Tarun Raheja , Nilay Pochhi

We address the problem of making a pre-trained reinforcement learning (RL) policy safety-aware by incorporating cost constraints without retraining it from scratch. While costs could be numerically encoded, we assume a more general setting…

Machine Learning · Computer Science 2026-05-21 Richa Verma , Bavish Kulur , Sanjay Chawla , Balaraman Ravindran

Recently, there has been significant interest in replacing the reward model in Reinforcement Learning with Human Feedback (RLHF) methods for Large Language Models (LLMs), such as Direct Preference Optimization (DPO) and its variants. These…

Computation and Language · Computer Science 2024-09-27 Jian Li , Haojing Huang , Yujia Zhang , Pengfei Xu , Xi Chen , Rui Song , Lida Shi , Jingwen Wang , Hao Xu

While astonishingly capable, large Language Models (LLM) can sometimes produce outputs that deviate from human expectations. Such deviations necessitate an alignment phase to prevent disseminating untruthful, toxic, or biased information.…

Artificial Intelligence · Computer Science 2024-10-30 Long Tan Le , Han Shu , Tung-Anh Nguyen , Choong Seon Hong , Nguyen H. Tran

Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, existing theories do not provide strong justification for…

Machine Learning · Computer Science 2026-05-19 Jihun Yun , Juno Kim , Jongho Park , Junhyuck Kim , Jongha Jon Ryu , Jaewoong Cho , Kwang-Sung Jun

Reinforcement Learning from Human Feedback (RLHF) is a key method for aligning large language models (LLMs) with human preferences. However, current offline alignment approaches like DPO, IPO, and SLiC rely heavily on fixed preference…

Machine Learning · Computer Science 2024-06-25 Mucong Ding , Souradip Chakraborty , Vibhu Agrawal , Zora Che , Alec Koppel , Mengdi Wang , Amrit Bedi , Furong Huang

While textless Spoken Language Models (SLMs) have shown potential in end-to-end speech-to-speech modeling, they still lag behind text-based Large Language Models (LLMs) in terms of semantic coherence and relevance. This work introduces the…

Computation and Language · Computer Science 2025-05-28 Guan-Ting Lin , Prashanth Gurunath Shivakumar , Aditya Gourav , Yile Gu , Ankur Gandhe , Hung-yi Lee , Ivan Bulyko

Preference alignment in Large Language Models (LLMs) has significantly improved their ability to adhere to human instructions and intentions. However, existing direct alignment algorithms primarily focus on relative preferences and often…

Machine Learning · Computer Science 2025-05-13 Shenao Zhang , Zhihan Liu , Boyi Liu , Yufeng Zhang , Yingxiang Yang , Yongfei Liu , Liyu Chen , Tao Sun , Zhaoran Wang

Accurately aligning large language models (LLMs) with human preferences is crucial for informing fair, economically sound, and statistically efficient decision-making processes. However, we argue that the predominant approach for aligning…

Machine Learning · Statistics 2025-08-26 Jiancong Xiao , Ziniu Li , Xingyu Xie , Emily Getzen , Cong Fang , Qi Long , Weijie J. Su

Aligning generative models with human preference via RLHF typically suffers from overoptimization, where an imperfectly learned reward model can misguide the generative model to output undesired responses. We investigate this problem in a…

Machine Learning · Computer Science 2024-12-05 Zhihan Liu , Miao Lu , Shenao Zhang , Boyi Liu , Hongyi Guo , Yingxiang Yang , Jose Blanchet , Zhaoran Wang

Preference alignment methods are increasingly critical for steering large language models (LLMs) to generate outputs consistent with human values. While recent approaches often rely on synthetic data generated by LLMs for scalability and…

Computation and Language · Computer Science 2025-10-21 Mingye Zhu , Yi Liu , Zheren Fu , Yongdong Zhang , Zhendong Mao

This work studies the challenge of aligning large language models (LLMs) with offline preference data. We focus on alignment by Reinforcement Learning from Human Feedback (RLHF) in particular. While popular preference optimization methods…

Machine Learning · Computer Science 2024-06-07 Xiang Ji , Sanjeev Kulkarni , Mengdi Wang , Tengyang Xie

Improving the alignment of language models with human preferences remains an active research challenge. Previous approaches have primarily utilized Reinforcement Learning from Human Feedback (RLHF) via online RL methods such as Proximal…

Computation and Language · Computer Science 2024-01-25 Tianqi Liu , Yao Zhao , Rishabh Joshi , Misha Khalman , Mohammad Saleh , Peter J. Liu , Jialu Liu

Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models (LLMs) to adhere to human intentions. Unlike offline alignment with a fixed…

Machine Learning · Computer Science 2024-11-06 Shenao Zhang , Donghan Yu , Hiteshi Sharma , Han Zhong , Zhihan Liu , Ziyi Yang , Shuohang Wang , Hany Hassan , Zhaoran Wang

Reinforcement Learning from Human Feedback (RLHF) has become the predominant approach for language model (LM) alignment. At its core, RLHF uses a margin-based loss for preference optimization, specifying ideal LM behavior only by the…

Machine Learning · Computer Science 2025-04-23 Hui Yuan , Yifan Zeng , Yue Wu , Huazheng Wang , Mengdi Wang , Liu Leqi

To design rewards that align with human goals, Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent technique for learning reward functions from human preferences and optimizing policies via reinforcement learning…

Machine Learning · Computer Science 2025-05-14 Taehyun Cho , Seokhun Ju , Seungyub Han , Dohyeong Kim , Kyungjae Lee , Jungwoo Lee

Reinforcement Learning (RL) algorithms for safety alignment of Large Language Models (LLMs), such as Direct Preference Optimization (DPO), encounter the challenge of distribution shift. Current approaches typically address this issue…

Computation and Language · Computer Science 2025-06-17 Qiyuan Deng , Xuefeng Bai , Kehai Chen , Yaowei Wang , Liqiang Nie , Min Zhang
‹ Prev 1 2 3 10 Next ›