Related papers: Improving Context-Aware Preference Modeling for La…

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

Preference alignment in Large Language Models (LLMs) has significantly improved their ability to adhere to human instructions and intentions. However, existing direct alignment algorithms primarily focus on relative preferences and often…

Machine Learning · Computer Science 2025-05-13 Shenao Zhang , Zhihan Liu , Boyi Liu , Yufeng Zhang , Yingxiang Yang , Yongfei Liu , Liyu Chen , Tao Sun , Zhaoran Wang

Preference learning in shades of gray: Interpretable and bias-aware reward modeling for human preferences

Learning human preferences in language models remains fundamentally challenging, as reward modeling relies on subtle, subjective comparisons or shades of gray rather than clear-cut labels. This study investigates the limits of current…

Computation and Language · Computer Science 2026-04-03 Simona-Vasilica Oprea , Adela Bâra

ICPL: Few-shot In-context Preference Learning via LLMs

Preference-based reinforcement learning is an effective way to handle tasks where rewards are hard to specify but can be exceedingly inefficient as preference learning is often tabula rasa. We demonstrate that Large Language Models (LLMs)…

Artificial Intelligence · Computer Science 2025-04-04 Chao Yu , Qixin Tan , Hong Lu , Jiaxuan Gao , Xinting Yang , Yu Wang , Yi Wu , Eugene Vinitsky

Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey

Preference tuning is a crucial process for aligning deep generative models with human preferences. This survey offers a thorough overview of recent advancements in preference tuning and the integration of human feedback. The paper is…

Computation and Language · Computer Science 2024-11-05 Genta Indra Winata , Hanyang Zhao , Anirban Das , Wenpin Tang , David D. Yao , Shi-Xiong Zhang , Sambit Sahu

Data-Centric Human Preference with Rationales for Direct Preference Alignment

Aligning language models with human preferences through reinforcement learning from human feedback is crucial for their safe and effective deployment. The human preference is typically represented through comparison where one response is…

Machine Learning · Computer Science 2025-07-15 Hoang Anh Just , Ming Jin , Anit Sahu , Huy Phan , Ruoxi Jia

Capturing Individual Human Preferences with Reward Features

Reinforcement learning from human feedback usually models preferences using a reward function that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for…

Artificial Intelligence · Computer Science 2026-02-20 André Barreto , Vincent Dumoulin , Yiran Mao , Mark Rowland , Nicolas Perez-Nieves , Bobak Shahriari , Yann Dauphin , Doina Precup , Hugo Larochelle

Establishing Knowledge Preference in Language Models

Language models are known to encode a great amount of factual knowledge through pretraining. However, such knowledge might be insufficient to cater to user requests, requiring the model to integrate external knowledge sources and adhere to…

Computation and Language · Computer Science 2024-07-19 Sizhe Zhou , Sha Li , Yu Meng , Yizhu Jiao , Heng Ji , Jiawei Han

Pragmatic Feature Preferences: Learning Reward-Relevant Preferences from Human Input

Humans use social context to specify preferences over behaviors, i.e. their reward functions. Yet, algorithms for inferring reward models from preference data do not take this social learning view into account. Inspired by pragmatic human…

Machine Learning · Computer Science 2024-05-24 Andi Peng , Yuying Sun , Tianmin Shu , David Abel

Preference Learning for AI Alignment: a Causal Perspective

Reward modelling from preference data is a crucial step in aligning large language models (LLMs) with human values, requiring robust generalisation to novel prompt-response pairs. In this work, we propose to frame this problem in a causal…

Artificial Intelligence · Computer Science 2026-05-12 Katarzyna Kobalczyk , Mihaela van der Schaar

Secrets of RLHF in Large Language Models Part II: Reward Modeling

Reinforcement Learning from Human Feedback (RLHF) has become a crucial technology for aligning language models with human values and intentions, enabling models to produce more helpful and harmless responses. Reward models are trained as…

Artificial Intelligence · Computer Science 2024-01-15 Binghai Wang , Rui Zheng , Lu Chen , Yan Liu , Shihan Dou , Caishuang Huang , Wei Shen , Senjie Jin , Enyu Zhou , Chenyu Shi , Songyang Gao , Nuo Xu , Yuhao Zhou , Xiaoran Fan , Zhiheng Xi , Jun Zhao , Xiao Wang , Tao Ji , Hang Yan , Lixing Shen , Zhan Chen , Tao Gui , Qi Zhang , Xipeng Qiu , Xuanjing Huang , Zuxuan Wu , Yu-Gang Jiang

Training Language Models with Language Feedback

Pretrained language models often do not perform tasks in ways that are in line with our preferences, e.g., generating offensive text or factually incorrect summaries. Recent work approaches the above issue by learning from a simple form of…

Computation and Language · Computer Science 2022-11-18 Jérémy Scheurer , Jon Ander Campos , Jun Shern Chan , Angelica Chen , Kyunghyun Cho , Ethan Perez

Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes

Large language models (LLMs) have achieved remarkable success, yet aligning their generations with human preferences remains a critical challenge. Existing approaches to preference modeling often rely on an explicit or implicit reward…

Computation and Language · Computer Science 2025-05-09 Zhuocheng Gong , Jian Guan , Wei Wu , Huishuai Zhang , Dongyan Zhao

Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries

Language model users often issue queries that lack specification, where the context under which a query was issued -- such as the user's identity, the query's intent, and the criteria for a response to be useful -- is not explicit. For…

Computation and Language · Computer Science 2025-05-27 Chaitanya Malaviya , Joseph Chee Chang , Dan Roth , Mohit Iyyer , Mark Yatskar , Kyle Lo

Everyone Deserves A Reward: Learning Customized Human Preferences

Reward models (RMs) are essential for aligning large language models (LLMs) with human preferences to improve interaction quality. However, the real world is pluralistic, which leads to diversified human preferences with respect to…

Computation and Language · Computer Science 2023-09-18 Pengyu Cheng , Jiawen Xie , Ke Bai , Yong Dai , Nan Du

Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models (LMs). Despite its widespread use, the way preference-based learning is applied varies wildly,…

Computation and Language · Computer Science 2024-10-10 Hamish Ivison , Yizhong Wang , Jiacheng Liu , Zeqiu Wu , Valentina Pyatkin , Nathan Lambert , Noah A. Smith , Yejin Choi , Hannaneh Hajishirzi

Cross-Preference Learning for Sentence-Level and Context-Aware Machine Translation

Context-aware machine translation (MT) leverages document-level information, yet it does not consistently outperform sentence-level MT, as contextual signals are unevenly beneficial across sentences. Existing training objectives do not…

Computation and Language · Computer Science 2026-03-27 Ying Li , Xinglin Lyu , Junhui Li , Jinlong Yang , Hengchao Shang , Min Zhang , Shimin Tao , Daimeng Wei

Reward Model Interpretability via Optimal and Pessimal Tokens

Reward modeling has emerged as a crucial component in aligning large language models with human values. Significant attention has focused on using reward models as a means for fine-tuning generative models. However, the reward models…

Computation and Language · Computer Science 2026-02-04 Brian Christian , Hannah Rose Kirk , Jessica A. F. Thompson , Christopher Summerfield , Tsvetomira Dumbalska

Learning Contextually-Adaptive Rewards via Calibrated Features

A key challenge in reward learning from human input is that desired agent behavior often changes based on context. For example, a robot must adapt to avoid a stove once it becomes hot. We observe that while high-level preferences (e.g.,…

Robotics · Computer Science 2026-01-14 Alexandra Forsey-Smerek , Julie Shah , Andreea Bobu

Preference-grounded Token-level Guidance for Language Model Fine-tuning

Aligning language models (LMs) with preferences is an important problem in natural language generation. A key challenge is that preferences are typically provided at the sequence level while LM training and generation both occur at the…

Computation and Language · Computer Science 2025-01-09 Shentao Yang , Shujian Zhang , Congying Xia , Yihao Feng , Caiming Xiong , Mingyuan Zhou

Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment

Direct Preference Optimization (DPO) has become a prominent method for aligning Large Language Models (LLMs) with human preferences. While DPO has enabled significant progress in aligning English LLMs, multilingual preference alignment is…

Computation and Language · Computer Science 2025-06-06 Wen Yang , Junhong Wu , Chen Wang , Chengqing Zong , Jiajun Zhang