Related papers: Clone-Robust AI Alignment

Reward-Robust RLHF in LLMs

As Large Language Models (LLMs) continue to progress toward more advanced forms of intelligence, Reinforcement Learning from Human Feedback (RLHF) is increasingly seen as a key pathway toward achieving Artificial General Intelligence (AGI).…

Machine Learning · Computer Science 2024-10-17 Yuzi Yan , Xingzhou Lou , Jialian Li , Yiping Zhang , Jian Xie , Chao Yu , Yu Wang , Dong Yan , Yuan Shen

RLTHF: Targeted Human Feedback for LLM Alignment

Fine-tuning large language models (LLMs) to align with user preferences is challenging due to the high cost of quality human annotations in Reinforcement Learning from Human Feedback (RLHF) and the generalizability limitations of AI…

Computation and Language · Computer Science 2025-08-08 Yifei Xu , Tusher Chakraborty , Emre Kıcıman , Bibek Aryal , Eduardo Rodrigues , Srinagesh Sharma , Roberto Estevao , Maria Angels de Luis Balaguer , Jessica Wolk , Rafael Padilha , Leonardo Nunes , Shobana Balakrishnan , Songwu Lu , Ranveer Chandra

CLHA: A Simple yet Effective Contrastive Learning Framework for Human Alignment

Reinforcement learning from human feedback (RLHF) is a crucial technique in aligning large language models (LLMs) with human preferences, ensuring these LLMs behave in beneficial and comprehensible ways to users. However, a longstanding…

Artificial Intelligence · Computer Science 2024-03-27 Feiteng Fang , Liang Zhu , Min Yang , Xi Feng , Jinchang Hou , Qixuan Zhao , Chengming Li , Xiping Hu , Ruifeng Xu

Doubly Robust Alignment for Large Language Models

This paper studies reinforcement learning from human feedback (RLHF) for aligning large language models with human preferences. While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the…

Machine Learning · Computer Science 2025-10-30 Erhan Xu , Kai Ye , Hongyi Zhou , Luhan Zhu , Francesco Quinzan , Chengchun Shi

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing…

Artificial Intelligence · Computer Science 2026-05-27 Dongyoon Hahm , Dylan Hadfield-Menell , Kimin Lee

Jackpot! Alignment as a Maximal Lottery

Reinforcement Learning from Human Feedback (RLHF), the standard for aligning Large Language Models (LLMs) with human values, is known to fail to satisfy properties that are intuitively desirable, such as respecting the preferences of the…

Artificial Intelligence · Computer Science 2025-02-03 Roberto-Rafael Maura-Rivero , Marc Lanctot , Francesco Visin , Kate Larson

RLHF: A comprehensive Survey for Cultural, Multimodal and Low Latency Alignment Methods

Reinforcement Learning from Human Feedback (RLHF) is the standard for aligning Large Language Models (LLMs), yet recent progress has moved beyond canonical text-based methods. This survey synthesizes the new frontier of alignment research…

Machine Learning · Computer Science 2025-11-07 Raghav Sharma , Manan Mehta , Sai Tiger Raina

Dual Active Learning for Reinforcement Learning from Human Feedback

Aligning large language models (LLMs) with human preferences is critical to recent advances in generative artificial intelligence. Reinforcement learning from human feedback (RLHF) is widely applied to achieve this objective. A key step in…

Machine Learning · Statistics 2025-01-03 Pangpang Liu , Chengchun Shi , Will Wei Sun

Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning

Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models (LLMs) with human preferences. To learn the reward function, most existing RLHF algorithms use the…

Machine Learning · Statistics 2026-02-11 Kai Ye , Hongyi Zhou , Jin Zhu , Francesco Quinzan , Chengchun Shi

Aligning Large Language Models with Human Preferences through Representation Engineering

Aligning large language models (LLMs) with human preferences is crucial for enhancing their utility in terms of helpfulness, truthfulness, safety, harmlessness, and interestingness. Existing methods for achieving this alignment often…

Computation and Language · Computer Science 2024-07-04 Wenhao Liu , Xiaohua Wang , Muling Wu , Tianlong Li , Changze Lv , Zixuan Ling , Jianhao Zhu , Cenyuan Zhang , Xiaoqing Zheng , Xuanjing Huang

Beyond RLHF: A Unified Theoretical Framework of Alignment

Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, existing theories do not provide strong justification for…

Machine Learning · Computer Science 2026-05-19 Jihun Yun , Juno Kim , Jongho Park , Junhyuck Kim , Jongha Jon Ryu , Jaewoong Cho , Kwang-Sung Jun

Real-Time Aligned Reward Model beyond Semantics

Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for aligning large language models (LLMs) with human preferences, yet it is susceptible to reward overoptimization, in which policy models overfit to the reward model,…

Artificial Intelligence · Computer Science 2026-05-19 Zixuan Huang , Xin Xia , Yuxi Ren , Jianbin Zheng , Xuefeng Xiao , Hongyan Xie , Li Huaqiu , Songshi Liang , Zhongxiang Dai , Fuzhen Zhuang , Jianxin Li , Yikun Ban , Deqing Wang

RLHF in an SFT Way: From Optimal Solution to Reward-Weighted Alignment

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. However, RLHF has been continuously challenged by its high complexity in implementation and computation consumption,…

Machine Learning · Computer Science 2026-03-24 Yuhao Du , Zhuo Li , Pengyu Cheng , Zhihong Chen , Yuejiao Xie , Xiang Wan , Anningzhe Gao

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values. However, RLHF relies on a reward model that is trained with a limited amount of human preference data,…

Machine Learning · Computer Science 2024-10-23 Shun Zhang , Zhenfang Chen , Sunli Chen , Yikang Shen , Zhiqing Sun , Chuang Gan

Aligning Neural Machine Translation Models: Human Feedback in Training and Inference

Reinforcement learning from human feedback (RLHF) is a recent technique to improve the quality of the text generated by a language model, making it closer to what humans would generate. A core ingredient in RLHF's success in aligning and…

Computation and Language · Computer Science 2024-07-08 Miguel Moura Ramos , Patrick Fernandes , António Farinhas , André F. T. Martins

Safe RLHF: Safe Reinforcement Learning from Human Feedback

With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness…

Artificial Intelligence · Computer Science 2023-10-20 Josef Dai , Xuehai Pan , Ruiyang Sun , Jiaming Ji , Xinbo Xu , Mickel Liu , Yizhou Wang , Yaodong Yang

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because current alignment research has primarily achieved…

Computation and Language · Computer Science 2025-02-17 Yi-Fan Zhang , Tao Yu , Haochen Tian , Chaoyou Fu , Peiyan Li , Jianshu Zeng , Wulin Xie , Yang Shi , Huanyu Zhang , Junkang Wu , Xue Wang , Yibo Hu , Bin Wen , Fan Yang , Zhang Zhang , Tingting Gao , Di Zhang , Liang Wang , Rong Jin , Tieniu Tan

MaxMin-RLHF: Alignment with Diverse Human Preferences

Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, such an approach overlooks the rich diversity of human preferences…

Computation and Language · Computer Science 2024-12-30 Souradip Chakraborty , Jiahao Qiu , Hui Yuan , Alec Koppel , Furong Huang , Dinesh Manocha , Amrit Singh Bedi , Mengdi Wang

A Technical Survey of Reinforcement Learning Techniques for Large Language Models

Reinforcement Learning (RL) has emerged as a transformative approach for aligning and enhancing Large Language Models (LLMs), addressing critical challenges in instruction following, ethical alignment, and reasoning capabilities. This…

Artificial Intelligence · Computer Science 2025-07-08 Saksham Sahai Srivastava , Vaneet Aggarwal

Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

Aligning human preference and value is an important requirement for building contemporary foundation models and embodied AI. However, popular approaches such as reinforcement learning with human feedback (RLHF) break down the task into…

Artificial Intelligence · Computer Science 2024-12-03 Chenliang Li , Siliang Zeng , Zeyi Liao , Jiaxiang Li , Dongyeop Kang , Alfredo Garcia , Mingyi Hong