Related papers: Adaptive Preference Aggregation

Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework

Conventional preference learning methods often prioritize opinions held more widely when aggregating preferences from multiple evaluators. This may result in policies that are biased in favor of some types of opinions or groups and…

Artificial Intelligence · Computer Science 2026-03-03 Kihyun Kim , Jiawei Zhang , Asuman Ozdaglar , Pablo A. Parrilo

RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation

Reinforcement learning from human feedback (RLHF) has been an effective technique for aligning AI systems with human values, with remarkable successes in fine-tuning large-language models recently. Most existing RLHF paradigms make the…

Artificial Intelligence · Computer Science 2024-05-28 Chanwoo Park , Mingyang Liu , Dingwen Kong , Kaiqing Zhang , Asuman Ozdaglar

Axioms for AI Alignment from Human Feedback

In the context of reinforcement learning from human feedback (RLHF), the reward function is generally derived from maximum likelihood estimation of a random utility model based on pairwise comparisons made by humans. The problem of learning…

Computer Science and Game Theory · Computer Science 2024-11-08 Luise Ge , Daniel Halpern , Evi Micha , Ariel D. Procaccia , Itai Shapira , Yevgeniy Vorobeychik , Junlin Wu

Theoretical Tensions in RLHF: Reconciling Empirical Success with Inconsistencies in Social Choice Theory

Despite its empirical success, Reinforcement Learning from Human Feedback (RLHF) has been shown to violate almost all the fundamental axioms in social choice theory -- such as majority consistency, pairwise majority consistency, and…

Machine Learning · Statistics 2025-06-17 Jiancong Xiao , Zhekun Shi , Kaizhao Liu , Qi Long , Weijie J. Su

AI Alignment and Social Choice: Fundamental Limitations and Policy Implications

Aligning AI agents to human intentions and values is a key bottleneck in building safe and deployable AI applications. But whose values should AI agents be aligned with? Reinforcement learning with human feedback (RLHF) has emerged as the…

Artificial Intelligence · Computer Science 2023-10-25 Abhilash Mishra

A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs

This paper addresses the challenge of aligning large language models (LLMs) with diverse human preferences within federated learning (FL) environments, where standard methods often fail to adequately represent diverse viewpoints. We…

Computation and Language · Computer Science 2025-12-17 Mahmoud Srewa , Tianyu Zhao , Salma Elmalaki

Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

Aligning human preference and value is an important requirement for building contemporary foundation models and embodied AI. However, popular approaches such as reinforcement learning with human feedback (RLHF) break down the task into…

Artificial Intelligence · Computer Science 2024-12-03 Chenliang Li , Siliang Zeng , Zeyi Liao , Jiaxiang Li , Dongyeop Kang , Alfredo Garcia , Mingyi Hong

Adaptive Preference Scaling for Reinforcement Learning with Human Feedback

Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values by learning rewards from human preference data. Due to various reasons, however, such data typically takes the form of rankings…

Machine Learning · Computer Science 2024-06-06 Ilgee Hong , Zichong Li , Alexander Bukharin , Yixiao Li , Haoming Jiang , Tianbao Yang , Tuo Zhao

Jackpot! Alignment as a Maximal Lottery

Reinforcement Learning from Human Feedback (RLHF), the standard for aligning Large Language Models (LLMs) with human values, is known to fail to satisfy properties that are intuitively desirable, such as respecting the preferences of the…

Artificial Intelligence · Computer Science 2025-02-03 Roberto-Rafael Maura-Rivero , Marc Lanctot , Francesco Visin , Kate Larson

Policy Aggregation

We consider the challenge of AI value alignment with multiple individuals that have different reward functions and optimal policies in an underlying Markov decision process. We formalize this problem as one of policy aggregation, where the…

Artificial Intelligence · Computer Science 2024-11-07 Parand A. Alamdari , Soroush Ebadian , Ariel D. Procaccia

Improving Generalization of Alignment with Human Preferences through Group Invariant Learning

The success of AI assistants based on language models (LLMs) hinges crucially on Reinforcement Learning from Human Feedback (RLHF), which enables the generation of responses more aligned with human preferences. As universal AI assistants,…

Machine Learning · Computer Science 2023-12-27 Rui Zheng , Wei Shen , Yuan Hua , Wenbin Lai , Shihan Dou , Yuhao Zhou , Zhiheng Xi , Xiao Wang , Haoran Huang , Tao Gui , Qi Zhang , Xuanjing Huang

MaxMin-RLHF: Alignment with Diverse Human Preferences

Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, such an approach overlooks the rich diversity of human preferences…

Computation and Language · Computer Science 2024-12-30 Souradip Chakraborty , Jiahao Qiu , Hui Yuan , Alec Koppel , Furong Huang , Dinesh Manocha , Amrit Singh Bedi , Mengdi Wang

Clone-Robust AI Alignment

A key challenge in training Large Language Models (LLMs) is properly aligning them with human preferences. Reinforcement Learning with Human Feedback (RLHF) uses pairwise comparisons from human annotators to train reward functions and has…

Machine Learning · Computer Science 2025-01-17 Ariel D. Procaccia , Benjamin Schiffer , Shirley Zhang

From Demonstrations to Rewards: Alignment Without Explicit Human Preferences

One of the challenges of aligning large models with human preferences lies in both the data requirements and the technical complexities of current approaches. Predominant methods, such as RLHF, involve multiple steps, each demanding…

Machine Learning · Computer Science 2025-03-19 Siliang Zeng , Yao Liu , Huzefa Rangwala , George Karypis , Mingyi Hong , Rasool Fakoor

APPA: Adaptive Preference Pluralistic Alignment for Fair Federated RLHF of LLMs

Aligning large language models (LLMs) with diverse human preferences requires pluralistic alignment, where a single model must respect the values of multiple distinct groups simultaneously. In federated reinforcement learning from human…

Machine Learning · Computer Science 2026-04-07 Mahmoud Srewa , Tianyu Zhao , Salma Elmalaki

Maximizing the efficiency of human feedback in AI alignment: a comparative analysis

Reinforcement Learning from Human Feedback (RLHF) relies on preference modeling to align machine learning systems with human values, yet the popular approach of random pair sampling with Bradley-Terry modeling is statistically limited and…

Human-Computer Interaction · Computer Science 2025-12-02 Andreas Chouliaras , Dimitris Chatzopoulos

Projection Optimization: A General Framework for Multi-Objective and Multi-Group RLHF

Reinforcement Learning with Human Feedback (RLHF) is a widely used fine-tuning approach that aligns machine learning model, particularly Language Model (LM) with human preferences. There are typically multiple objectives driving the…

Machine Learning · Computer Science 2025-02-25 Nuoya Xiong , Aarti Singh

Preference Ranking Optimization for Human Alignment

Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values to ensure secure AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment.…

Computation and Language · Computer Science 2024-02-28 Feifan Song , Bowen Yu , Minghao Li , Haiyang Yu , Fei Huang , Yongbin Li , Houfeng Wang

Contextual Online Uncertainty-Aware Preference Learning for Human Feedback

Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm in artificial intelligence to align large models with human preferences. In this paper, we propose a novel statistical framework to simultaneously conduct the…

Machine Learning · Statistics 2026-05-01 Nan Lu , Ethan Lee , Ethan X. Fang , Junwei Lu

Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback

The success of AI assistants based on Language Models (LLMs) hinges on Reinforcement Learning from Human Feedback (RLHF) to comprehend and align with user intentions. However, traditional alignment algorithms, such as PPO, are hampered by…

Computation and Language · Computer Science 2024-07-03 Songyang Gao , Qiming Ge , Wei Shen , Shihan Dou , Junjie Ye , Xiao Wang , Rui Zheng , Yicheng Zou , Zhi Chen , Hang Yan , Qi Zhang , Dahua Lin