Related papers: Self-supervised Attribute-aware Dynamic Preference…

Data-Centric Human Preference with Rationales for Direct Preference Alignment

Aligning language models with human preferences through reinforcement learning from human feedback is crucial for their safe and effective deployment. The human preference is typically represented through comparison where one response is…

Machine Learning · Computer Science 2025-07-15 Hoang Anh Just , Ming Jin , Anit Sahu , Huy Phan , Ruoxi Jia

Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignment

Aligning large language models (LLMs) with human preferences becomes a key component to obtaining state-of-the-art performance, but it yields a huge cost to construct a large human-annotated preference dataset. To tackle this problem, we…

Machine Learning · Computer Science 2025-03-05 Dongyoung Kim , Kimin Lee , Jinwoo Shin , Jaehyung Kim

Direct Preference Optimization With Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences

Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with…

Machine Learning · Computer Science 2025-10-21 Keertana Chidambaram , Karthik Vinay Seetharaman , Vasilis Syrgkanis

Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences

Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with…

Artificial Intelligence · Computer Science 2025-10-20 Keertana Chidambaram , Karthik Vinary Seetharaman , Vasilis Syrgkanis

Reinforcement Learning from Human Feedback with Active Queries

Aligning large language models (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human feedback (RLHF). Despite their superior performance, current RLHF…

Machine Learning · Computer Science 2025-02-12 Kaixuan Ji , Jiafan He , Quanquan Gu

Adaptive Preference Scaling for Reinforcement Learning with Human Feedback

Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values by learning rewards from human preference data. Due to various reasons, however, such data typically takes the form of rankings…

Machine Learning · Computer Science 2024-06-06 Ilgee Hong , Zichong Li , Alexander Bukharin , Yixiao Li , Haoming Jiang , Tianbao Yang , Tuo Zhao

Low-Rank Contextual Reinforcement Learning from Heterogeneous Human Feedback

Reinforcement learning from human feedback (RLHF) has become a cornerstone for aligning large language models with human preferences. However, the heterogeneity of human feedback, driven by diverse individual contexts and preferences, poses…

Machine Learning · Statistics 2026-03-05 Seong Jin Lee , Will Wei Sun , Yufeng Liu

Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

Aligning human preference and value is an important requirement for building contemporary foundation models and embodied AI. However, popular approaches such as reinforcement learning with human feedback (RLHF) break down the task into…

Artificial Intelligence · Computer Science 2024-12-03 Chenliang Li , Siliang Zeng , Zeyi Liao , Jiaxiang Li , Dongyeop Kang , Alfredo Garcia , Mingyi Hong

When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning

While Reinforcement Learning from Human Feedback (RLHF) is widely used to align Large Language Models (LLMs) with human preferences, it typically assumes homogeneous preferences across users, overlooking diverse human values and minority…

Computation and Language · Computer Science 2025-10-28 Yijiang River Dong , Tiancheng Hu , Yinhong Liu , Ahmet Üstün , Nigel Collier

Preference Ranking Optimization for Human Alignment

Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values to ensure secure AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment.…

Computation and Language · Computer Science 2024-02-28 Feifan Song , Bowen Yu , Minghao Li , Haiyang Yu , Fei Huang , Yongbin Li , Houfeng Wang

Alignment is Localized: A Causal Probe into Preference Layers

Reinforcement Learning frameworks, particularly those utilizing human annotations, have become an increasingly popular method for preference fine-tuning, where the outputs of a language model are tuned to match a certain set of behavioral…

Machine Learning · Computer Science 2025-10-21 Archie Chaudhury

Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning

Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning foundation models to human values and preferences. However, current RLHF techniques cannot account for the naturally occurring differences in individual…

Machine Learning · Computer Science 2024-08-20 Sriyash Poddar , Yanming Wan , Hamish Ivison , Abhishek Gupta , Natasha Jaques

Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning

Preference-based reinforcement learning (RL) offers a promising approach for aligning policies with human intent but is often constrained by the high cost of human feedback. In this work, we introduce PrefVLM, a framework that integrates…

Machine Learning · Computer Science 2025-02-04 Udita Ghosh , Dripta S. Raychaudhuri , Jiachen Li , Konstantinos Karydis , Amit Roy-Chowdhury

Alignment Data Map for Efficient Preference Data Selection and Diagnosis

Human preference data is essential for aligning large language models (LLMs) with human values, but collecting such data is often costly and inefficient-motivating the need for efficient data selection methods that reduce annotation costs…

Computation and Language · Computer Science 2026-04-21 Seohyeong Lee , Eunwon Kim , Hwaran Lee , Buru Chang

AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable Diffusion Model

Aligning agent behaviors with diverse human preferences remains a challenging problem in reinforcement learning (RL), owing to the inherent abstractness and mutability of human preferences. To address these issues, we propose AlignDiff, a…

Artificial Intelligence · Computer Science 2024-02-06 Zibin Dong , Yifu Yuan , Jianye Hao , Fei Ni , Yao Mu , Yan Zheng , Yujing Hu , Tangjie Lv , Changjie Fan , Zhipeng Hu

Aligning Deep Implicit Preferences by Learning to Reason Defensively

Personalized alignment is crucial for enabling Large Language Models (LLMs) to engage effectively in user-centric interactions. However, current methods face a dual challenge: they fail to infer users' deep implicit preferences (including…

Artificial Intelligence · Computer Science 2026-04-29 Peiming Li , Zhiyuan Hu , Yang Tang , Shiyu Li , Xi Chen

Learning from Preferences and Mixed Demonstrations in General Settings

Reinforcement learning is a general method for learning in sequential settings, but it can often be difficult to specify a good reward function when the task is complex. In these cases, preference feedback or expert demonstrations can be…

Machine Learning · Computer Science 2025-08-20 Jason R Brown , Carl Henrik Ek , Robert D Mullins

Weak Human Preference Supervision For Deep Reinforcement Learning

The current reward learning from human preferences could be used to resolve complex reinforcement learning (RL) tasks without access to a reward function by defining a single fixed preference between pairs of trajectory segments. However,…

Artificial Intelligence · Computer Science 2020-12-29 Zehong Cao , KaiChiu Wong , Chin-Teng Lin

The Limits of Preference Data for Post-Training

Recent progress in strengthening the capabilities of large language models has stemmed from applying reinforcement learning to domains with automatically verifiable outcomes. A key question is whether we can similarly use RL to optimize for…

Machine Learning · Computer Science 2025-05-27 Eric Zhao , Jessica Dai , Pranjal Awasthi

From Demonstrations to Rewards: Alignment Without Explicit Human Preferences

One of the challenges of aligning large models with human preferences lies in both the data requirements and the technical complexities of current approaches. Predominant methods, such as RLHF, involve multiple steps, each demanding…

Machine Learning · Computer Science 2025-03-19 Siliang Zeng , Yao Liu , Huzefa Rangwala , George Karypis , Mingyi Hong , Rasool Fakoor