Related papers: Conformal Feedback Alignment: Quantifying Answer-L…

CLHA: A Simple yet Effective Contrastive Learning Framework for Human Alignment

Reinforcement learning from human feedback (RLHF) is a crucial technique in aligning large language models (LLMs) with human preferences, ensuring these LLMs behave in beneficial and comprehensible ways to users. However, a longstanding…

Artificial Intelligence · Computer Science 2024-03-27 Feiteng Fang , Liang Zhu , Min Yang , Xi Feng , Jinchang Hou , Qixuan Zhao , Chengming Li , Xiping Hu , Ruifeng Xu

When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning

While Reinforcement Learning from Human Feedback (RLHF) is widely used to align Large Language Models (LLMs) with human preferences, it typically assumes homogeneous preferences across users, overlooking diverse human values and minority…

Computation and Language · Computer Science 2025-10-28 Yijiang River Dong , Tiancheng Hu , Yinhong Liu , Ahmet Üstün , Nigel Collier

Beyond RLHF: A Unified Theoretical Framework of Alignment

Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, existing theories do not provide strong justification for…

Machine Learning · Computer Science 2026-05-19 Jihun Yun , Juno Kim , Jongho Park , Junhyuck Kim , Jongha Jon Ryu , Jaewoong Cho , Kwang-Sung Jun

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing…

Artificial Intelligence · Computer Science 2026-05-27 Dongyoon Hahm , Dylan Hadfield-Menell , Kimin Lee

Alignment is Localized: A Causal Probe into Preference Layers

Reinforcement Learning frameworks, particularly those utilizing human annotations, have become an increasingly popular method for preference fine-tuning, where the outputs of a language model are tuned to match a certain set of behavioral…

Machine Learning · Computer Science 2025-10-21 Archie Chaudhury

More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness

The trustworthiness of Large Language Models (LLMs) refers to the extent to which their outputs are reliable, safe, and ethically aligned, and it has become a crucial consideration alongside their cognitive performance. In practice,…

Computation and Language · Computer Science 2024-12-24 Aaron J. Li , Satyapriya Krishna , Himabindu Lakkaraju

Aligning Audio Captions with Human Preferences

Current audio captioning relies on supervised learning with paired audio-caption data, which is costly to curate and may not reflect human preferences in real-world scenarios. To address this, we propose a preference-aligned audio…

Audio and Speech Processing · Electrical Eng. & Systems 2026-02-26 Kartik Hegde , Rehana Mahfuz , Yinyi Guo , Erik Visser

The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback

Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) more capable in complex settings. RLHF proceeds as collecting human preference data, training a reward model on said…

Machine Learning · Computer Science 2024-02-05 Nathan Lambert , Roberto Calandra

Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning

Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning foundation models to human values and preferences. However, current RLHF techniques cannot account for the naturally occurring differences in individual…

Machine Learning · Computer Science 2024-08-20 Sriyash Poddar , Yanming Wan , Hamish Ivison , Abhishek Gupta , Natasha Jaques

Aligning to What? Limits to RLHF Based Alignment

Reinforcement Learning from Human Feedback (RLHF) is increasingly used to align large language models (LLMs) with human preferences. However, the effectiveness of RLHF in addressing underlying biases remains unclear. This study investigates…

Computation and Language · Computer Science 2025-03-13 Logan Barnhart , Reza Akbarian Bafghi , Stephen Becker , Maziar Raissi

SEAL: Systematic Error Analysis for Value ALignment

Reinforcement Learning from Human Feedback (RLHF) aims to align language models (LMs) with human values by training reward models (RMs) on binary preferences and using these RMs to fine-tune the base LMs. Despite its importance, the…

Machine Learning · Computer Science 2024-08-21 Manon Revel , Matteo Cargnelutti , Tyna Eloundou , Greg Leppert

Adaptive Preference Scaling for Reinforcement Learning with Human Feedback

Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values by learning rewards from human preference data. Due to various reasons, however, such data typically takes the form of rankings…

Machine Learning · Computer Science 2024-06-06 Ilgee Hong , Zichong Li , Alexander Bukharin , Yixiao Li , Haoming Jiang , Tianbao Yang , Tuo Zhao

Towards Reliable Alignment: Uncertainty-aware RLHF

Recent advances in aligning Large Language Models with human preferences have benefited from larger reward models and better preference data. However, most of these methodologies rely on the accuracy of the reward model. The reward models…

Artificial Intelligence · Computer Science 2024-11-01 Debangshu Banerjee , Aditya Gopalan

Aligning Large Language Models with Human Preferences through Representation Engineering

Aligning large language models (LLMs) with human preferences is crucial for enhancing their utility in terms of helpfulness, truthfulness, safety, harmlessness, and interestingness. Existing methods for achieving this alignment often…

Computation and Language · Computer Science 2024-07-04 Wenhao Liu , Xiaohua Wang , Muling Wu , Tianlong Li , Changze Lv , Zixuan Ling , Jianhao Zhu , Cenyuan Zhang , Xiaoqing Zheng , Xuanjing Huang

Governance Challenges in Reinforcement Learning from Human Feedback: Evaluator Rationality and Reinforcement Stability

Reinforcement Learning from Human Feedback (RLHF) is central in aligning large language models (LLMs) with human values and expectations. However, the process remains susceptible to governance challenges, including evaluator bias,…

Computers and Society · Computer Science 2025-04-22 Dana Alsagheer , Abdulrahman Kamal , Mohammad Kamal , Weidong Shi

Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback

The success of AI assistants based on Language Models (LLMs) hinges on Reinforcement Learning from Human Feedback (RLHF) to comprehend and align with user intentions. However, traditional alignment algorithms, such as PPO, are hampered by…

Computation and Language · Computer Science 2024-07-03 Songyang Gao , Qiming Ge , Wei Shen , Shihan Dou , Junjie Ye , Xiao Wang , Rui Zheng , Yicheng Zou , Zhi Chen , Hang Yan , Qi Zhang , Dahua Lin

Implicit Regularization in Feedback Alignment Learning Mechanisms for Neural Networks

Feedback Alignment (FA) methods are biologically inspired local learning rules for training neural networks with reduced communication between layers. While FA has potential applications in distributed and privacy-aware ML, limitations in…

Machine Learning · Computer Science 2024-06-05 Zachary Robertson , Oluwasanmi Koyejo

Contextual Online Uncertainty-Aware Preference Learning for Human Feedback

Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm in artificial intelligence to align large models with human preferences. In this paper, we propose a novel statistical framework to simultaneously conduct the…

Machine Learning · Statistics 2026-05-01 Nan Lu , Ethan Lee , Ethan X. Fang , Junwei Lu

Provable Multi-Party Reinforcement Learning with Diverse Human Feedback

Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences. Typically, RLHF aggregates preferences from multiple individuals who have diverse viewpoints that may conflict with each…

Machine Learning · Computer Science 2024-03-11 Huiying Zhong , Zhun Deng , Weijie J. Su , Zhiwei Steven Wu , Linjun Zhang

Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

Aligning human preference and value is an important requirement for building contemporary foundation models and embodied AI. However, popular approaches such as reinforcement learning with human feedback (RLHF) break down the task into…

Artificial Intelligence · Computer Science 2024-12-03 Chenliang Li , Siliang Zeng , Zeyi Liao , Jiaxiang Li , Dongyeop Kang , Alfredo Garcia , Mingyi Hong