English
Related papers

Related papers: Conformal Feedback Alignment: Quantifying Answer-L…

200 papers

Reinforcement learning from human feedback (RLHF) is a crucial technique in aligning large language models (LLMs) with human preferences, ensuring these LLMs behave in beneficial and comprehensible ways to users. However, a longstanding…

Artificial Intelligence · Computer Science 2024-03-27 Feiteng Fang , Liang Zhu , Min Yang , Xi Feng , Jinchang Hou , Qixuan Zhao , Chengming Li , Xiping Hu , Ruifeng Xu

While Reinforcement Learning from Human Feedback (RLHF) is widely used to align Large Language Models (LLMs) with human preferences, it typically assumes homogeneous preferences across users, overlooking diverse human values and minority…

Computation and Language · Computer Science 2025-10-28 Yijiang River Dong , Tiancheng Hu , Yinhong Liu , Ahmet Üstün , Nigel Collier

Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, existing theories do not provide strong justification for…

Machine Learning · Computer Science 2026-05-19 Jihun Yun , Juno Kim , Jongho Park , Junhyuck Kim , Jongha Jon Ryu , Jaewoong Cho , Kwang-Sung Jun

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing…

Artificial Intelligence · Computer Science 2026-05-27 Dongyoon Hahm , Dylan Hadfield-Menell , Kimin Lee

Reinforcement Learning frameworks, particularly those utilizing human annotations, have become an increasingly popular method for preference fine-tuning, where the outputs of a language model are tuned to match a certain set of behavioral…

Machine Learning · Computer Science 2025-10-21 Archie Chaudhury

The trustworthiness of Large Language Models (LLMs) refers to the extent to which their outputs are reliable, safe, and ethically aligned, and it has become a crucial consideration alongside their cognitive performance. In practice,…

Computation and Language · Computer Science 2024-12-24 Aaron J. Li , Satyapriya Krishna , Himabindu Lakkaraju

Current audio captioning relies on supervised learning with paired audio-caption data, which is costly to curate and may not reflect human preferences in real-world scenarios. To address this, we propose a preference-aligned audio…

Audio and Speech Processing · Electrical Eng. & Systems 2026-02-26 Kartik Hegde , Rehana Mahfuz , Yinyi Guo , Erik Visser

Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) more capable in complex settings. RLHF proceeds as collecting human preference data, training a reward model on said…

Machine Learning · Computer Science 2024-02-05 Nathan Lambert , Roberto Calandra

Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning foundation models to human values and preferences. However, current RLHF techniques cannot account for the naturally occurring differences in individual…

Machine Learning · Computer Science 2024-08-20 Sriyash Poddar , Yanming Wan , Hamish Ivison , Abhishek Gupta , Natasha Jaques

Reinforcement Learning from Human Feedback (RLHF) is increasingly used to align large language models (LLMs) with human preferences. However, the effectiveness of RLHF in addressing underlying biases remains unclear. This study investigates…

Computation and Language · Computer Science 2025-03-13 Logan Barnhart , Reza Akbarian Bafghi , Stephen Becker , Maziar Raissi

Reinforcement Learning from Human Feedback (RLHF) aims to align language models (LMs) with human values by training reward models (RMs) on binary preferences and using these RMs to fine-tune the base LMs. Despite its importance, the…

Machine Learning · Computer Science 2024-08-21 Manon Revel , Matteo Cargnelutti , Tyna Eloundou , Greg Leppert

Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values by learning rewards from human preference data. Due to various reasons, however, such data typically takes the form of rankings…

Machine Learning · Computer Science 2024-06-06 Ilgee Hong , Zichong Li , Alexander Bukharin , Yixiao Li , Haoming Jiang , Tianbao Yang , Tuo Zhao

Recent advances in aligning Large Language Models with human preferences have benefited from larger reward models and better preference data. However, most of these methodologies rely on the accuracy of the reward model. The reward models…

Artificial Intelligence · Computer Science 2024-11-01 Debangshu Banerjee , Aditya Gopalan

Aligning large language models (LLMs) with human preferences is crucial for enhancing their utility in terms of helpfulness, truthfulness, safety, harmlessness, and interestingness. Existing methods for achieving this alignment often…

Computation and Language · Computer Science 2024-07-04 Wenhao Liu , Xiaohua Wang , Muling Wu , Tianlong Li , Changze Lv , Zixuan Ling , Jianhao Zhu , Cenyuan Zhang , Xiaoqing Zheng , Xuanjing Huang

Reinforcement Learning from Human Feedback (RLHF) is central in aligning large language models (LLMs) with human values and expectations. However, the process remains susceptible to governance challenges, including evaluator bias,…

Computers and Society · Computer Science 2025-04-22 Dana Alsagheer , Abdulrahman Kamal , Mohammad Kamal , Weidong Shi

The success of AI assistants based on Language Models (LLMs) hinges on Reinforcement Learning from Human Feedback (RLHF) to comprehend and align with user intentions. However, traditional alignment algorithms, such as PPO, are hampered by…

Computation and Language · Computer Science 2024-07-03 Songyang Gao , Qiming Ge , Wei Shen , Shihan Dou , Junjie Ye , Xiao Wang , Rui Zheng , Yicheng Zou , Zhi Chen , Hang Yan , Qi Zhang , Dahua Lin

Feedback Alignment (FA) methods are biologically inspired local learning rules for training neural networks with reduced communication between layers. While FA has potential applications in distributed and privacy-aware ML, limitations in…

Machine Learning · Computer Science 2024-06-05 Zachary Robertson , Oluwasanmi Koyejo

Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm in artificial intelligence to align large models with human preferences. In this paper, we propose a novel statistical framework to simultaneously conduct the…

Machine Learning · Statistics 2026-05-01 Nan Lu , Ethan Lee , Ethan X. Fang , Junwei Lu

Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences. Typically, RLHF aggregates preferences from multiple individuals who have diverse viewpoints that may conflict with each…

Machine Learning · Computer Science 2024-03-11 Huiying Zhong , Zhun Deng , Weijie J. Su , Zhiwei Steven Wu , Linjun Zhang

Aligning human preference and value is an important requirement for building contemporary foundation models and embodied AI. However, popular approaches such as reinforcement learning with human feedback (RLHF) break down the task into…

Artificial Intelligence · Computer Science 2024-12-03 Chenliang Li , Siliang Zeng , Zeyi Liao , Jiaxiang Li , Dongyeop Kang , Alfredo Garcia , Mingyi Hong
‹ Prev 1 2 3 10 Next ›