Related papers: Preference Learning for AI Alignment: a Causal Per…

Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment

Recent advances in large language models (LLMs) have demonstrated significant progress in performing complex tasks. While Reinforcement Learning from Human Feedback (RLHF) has been effective in aligning LLMs with human preferences, it is…

Machine Learning · Computer Science 2025-05-30 Chaoqi Wang , Zhuokai Zhao , Yibo Jiang , Zhaorun Chen , Chen Zhu , Yuxin Chen , Jiayi Liu , Lizhu Zhang , Xiangjun Fan , Hao Ma , Sinong Wang

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

Preference alignment in Large Language Models (LLMs) has significantly improved their ability to adhere to human instructions and intentions. However, existing direct alignment algorithms primarily focus on relative preferences and often…

Machine Learning · Computer Science 2025-05-13 Shenao Zhang , Zhihan Liu , Boyi Liu , Yufeng Zhang , Yingxiang Yang , Yongfei Liu , Liyu Chen , Tao Sun , Zhaoran Wang

Causality for Large Language Models

Recent breakthroughs in artificial intelligence have driven a paradigm shift, where large language models (LLMs) with billions or trillions of parameters are trained on vast datasets, achieving unprecedented success across a series of…

Computation and Language · Computer Science 2024-10-22 Anpeng Wu , Kun Kuang , Minqin Zhu , Yingrong Wang , Yujia Zheng , Kairong Han , Baohong Li , Guangyi Chen , Fei Wu , Kun Zhang

A Survey on Human Preference Learning for Large Language Models

The recent surge of versatile large language models (LLMs) largely depends on aligning increasingly capable foundation models with human intentions by preference learning, enhancing LLMs with excellent applicability and effectiveness in a…

Computation and Language · Computer Science 2024-06-19 Ruili Jiang , Kehai Chen , Xuefeng Bai , Zhixuan He , Juntao Li , Muyun Yang , Tiejun Zhao , Liqiang Nie , Min Zhang

Larger or Smaller Reward Margins to Select Preferences for Alignment?

Preference learning is critical for aligning large language models (LLMs) with human values, with the quality of preference datasets playing a crucial role in this process. While existing metrics primarily assess data quality based on…

Machine Learning · Computer Science 2025-03-05 Kexin Huang , Junkang Wu , Ziqian Chen , Xue Wang , Jinyang Gao , Bolin Ding , Jiancan Wu , Xiangnan He , Xiang Wang

Towards a Unified View of Preference Learning for Large Language Models: A Survey

Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial factors to achieve success is aligning the LLM's output with human preferences. This alignment process often requires only a small amount of data to…

Computation and Language · Computer Science 2024-11-01 Bofei Gao , Feifan Song , Yibo Miao , Zefan Cai , Zhe Yang , Liang Chen , Helan Hu , Runxin Xu , Qingxiu Dong , Ce Zheng , Shanghaoran Quan , Wen Xiao , Ge Zhang , Daoguang Zan , Keming Lu , Bowen Yu , Dayiheng Liu , Zeyu Cui , Jian Yang , Lei Sha , Houfeng Wang , Zhifang Sui , Peiyi Wang , Tianyu Liu , Baobao Chang

Data-Centric Human Preference with Rationales for Direct Preference Alignment

Aligning language models with human preferences through reinforcement learning from human feedback is crucial for their safe and effective deployment. The human preference is typically represented through comparison where one response is…

Machine Learning · Computer Science 2025-07-15 Hoang Anh Just , Ming Jin , Anit Sahu , Huy Phan , Ruoxi Jia

Pragmatic Feature Preferences: Learning Reward-Relevant Preferences from Human Input

Humans use social context to specify preferences over behaviors, i.e. their reward functions. Yet, algorithms for inferring reward models from preference data do not take this social learning view into account. Inspired by pragmatic human…

Machine Learning · Computer Science 2024-05-24 Andi Peng , Yuying Sun , Tianmin Shu , David Abel

Beyond the Binary: Capturing Diverse Preferences With Reward Regularization

Large language models (LLMs) are increasingly deployed via public-facing interfaces to interact with millions of users, each with diverse preferences. Despite this, preference tuning of LLMs predominantly relies on reward models trained…

Computation and Language · Computer Science 2024-12-06 Vishakh Padmakumar , Chuanyang Jin , Hannah Rose Kirk , He He

Alignment Revisited: Are Large Language Models Consistent in Stated and Revealed Preferences?

Recent advances in Large Language Models (LLMs) highlight the need to align their behaviors with human values. A critical, yet understudied, issue is the potential divergence between an LLM's stated preferences (its reported alignment with…

Artificial Intelligence · Computer Science 2025-06-03 Zhuojun Gu , Quan Wang , Shuchu Han

Preference learning in shades of gray: Interpretable and bias-aware reward modeling for human preferences

Learning human preferences in language models remains fundamentally challenging, as reward modeling relies on subtle, subjective comparisons or shades of gray rather than clear-cut labels. This study investigates the limits of current…

Computation and Language · Computer Science 2026-04-03 Simona-Vasilica Oprea , Adela Bâra

Causal Inference with Large Language Model: A Survey

Causal inference has been a pivotal challenge across diverse domains such as medicine and economics, demanding a complicated integration of human knowledge, mathematical reasoning, and data mining capabilities. Recent advancements in…

Computation and Language · Computer Science 2025-02-11 Jing Ma

Can We Utilize Pre-trained Language Models within Causal Discovery Algorithms?

Scaling laws have allowed Pre-trained Language Models (PLMs) into the field of causal reasoning. Causal reasoning of PLM relies solely on text-based descriptions, in contrast to causal discovery which aims to determine the causal…

Artificial Intelligence · Computer Science 2023-11-21 Chanhui Lee , Juhyeon Kim , Yongjun Jeong , Juhyun Lyu , Junghee Kim , Sangmin Lee , Sangjun Han , Hyeokjun Choe , Soyeon Park , Woohyung Lim , Sungbin Lim , Sanghack Lee

PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences

Large foundation models pretrained on raw web-scale data are not readily deployable without additional step of extensive alignment to human preferences. Such alignment is typically done by collecting large amounts of pairwise comparisons…

Machine Learning · Computer Science 2024-06-13 Daiwei Chen , Yi Chen , Aniket Rege , Ramya Korlakai Vinayak

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values. While benchmarks for general response…

Computation and Language · Computer Science 2026-04-09 Qiyao Ma , Dechen Gao , Rui Cai , Boqi Zhao , Hanchu Zhou , Junshan Zhang , Zhe Zhao

Towards Reliable, Uncertainty-Aware Alignment

Alignment of large language models (LLMs) typically involves training a reward model on preference data, followed by policy optimization with respect to the reward model. However, optimizing policies with respect to a single reward model…

Machine Learning · Computer Science 2025-07-23 Debangshu Banerjee , Kintan Saha , Aditya Gopalan

Do Large Language Models Show Biases in Causal Learning? Insights from Contingency Judgment

Causal learning is the cognitive process of developing the capability of making causal inferences based on available information, often guided by normative principles. This process is prone to errors and biases, such as the illusion of…

Artificial Intelligence · Computer Science 2025-10-17 María Victoria Carro , Denise Alejandra Mester , Francisca Gauna Selasco , Giovanni Franco Gabriel Marraffini , Mario Alejandro Leiva , Gerardo I. Simari , María Vanina Martinez

Understanding the Learning Dynamics of Alignment with Human Feedback

Aligning large language models (LLMs) with human intentions has become a critical task for safely deploying models in real-world systems. While existing alignment approaches have seen empirical success, theoretically understanding how these…

Machine Learning · Computer Science 2024-08-08 Shawn Im , Yixuan Li

Aligning LLMs with Domain Invariant Reward Models

Aligning large language models (LLMs) to human preferences is challenging in domains where preference data is unavailable. We address the problem of learning reward models for such target domains by leveraging feedback collected from…

Machine Learning · Computer Science 2025-01-03 David Wu , Sanjiban Choudhury

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

Accurately aligning large language models (LLMs) with human preferences is crucial for informing fair, economically sound, and statistically efficient decision-making processes. However, we argue that the predominant approach for aligning…

Machine Learning · Statistics 2025-08-26 Jiancong Xiao , Ziniu Li , Xingyu Xie , Emily Getzen , Cong Fang , Qi Long , Weijie J. Su