Related papers: REFA: Reference Free Alignment for multi-preferenc…

RE-PO: Robust Enhanced Policy Optimization as a General Framework for LLM Alignment

Standard human preference-based alignment methods, such as Reinforcement Learning from Human Feedback (RLHF), are a cornerstone for aligning large language models (LLMs) with human values. However, these methods typically assume that…

Artificial Intelligence · Computer Science 2026-03-02 Xiaoyang Cao , Zelai Xu , Mo Guang , Kaiwen Long , Michiel A. Bakker , Yu Wang , Chao Yu

Decoding-time Realignment of Language Models

Aligning language models with human preferences is crucial for reducing errors and biases in these models. Alignment techniques, such as reinforcement learning from human feedback (RLHF), are typically cast as optimizing a tradeoff between…

Machine Learning · Computer Science 2024-05-27 Tianlin Liu , Shangmin Guo , Leonardo Bianco , Daniele Calandriello , Quentin Berthet , Felipe Llinares , Jessica Hoffmann , Lucas Dixon , Michal Valko , Mathieu Blondel

RePO: Understanding Preference Learning Through ReLU-Based Optimization

Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single…

Machine Learning · Computer Science 2025-10-28 Junkang Wu , Kexin Huang , Xue Wang , Jinyang Gao , Bolin Ding , Jiancan Wu , Xiangnan He , Xiang Wang

T-REG: Preference Optimization with Token-Level Reward Regularization

Reinforcement learning from human feedback (RLHF) has been crucial in aligning large language models (LLMs) with human values. Traditionally, RLHF involves generating responses to a query and using a reward model to assign a reward to the…

Computation and Language · Computer Science 2024-12-04 Wenxuan Zhou , Shujian Zhang , Lingxiao Zhao , Tao Meng

Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning

Recently, Large Language Models (LLMs) have rapidly evolved, approaching Artificial General Intelligence (AGI) while benefiting from large-scale reinforcement learning to enhance Human Alignment (HA) and Reasoning. Recent reward-based…

Machine Learning · Computer Science 2025-06-19 Xuerui Su , Shufang Xie , Guoqing Liu , Yingce Xia , Renqian Luo , Peiran Jin , Zhiming Ma , Yue Wang , Zun Wang , Yuting Liu

SeRA: Self-Reviewing and Alignment of Large Language Models using Implicit Reward Margins

Direct alignment algorithms (DAAs), such as direct preference optimization (DPO), have become popular alternatives for Reinforcement Learning from Human Feedback (RLHF) due to their simplicity, efficiency, and stability. However, the…

Machine Learning · Computer Science 2024-10-15 Jongwoo Ko , Saket Dingliwal , Bhavana Ganesh , Sailik Sengupta , Sravan Bodapati , Aram Galstyan

RLAF: Reinforcement Learning from Automaton Feedback

Reinforcement Learning (RL) in environments with complex, history-dependent reward structures poses significant challenges for traditional methods. In this work, we introduce a novel approach that leverages automaton-based feedback to guide…

Machine Learning · Computer Science 2025-10-20 Mahyar Alinejad , Alvaro Velasquez , Yue Wang , George Atia

Reverse Preference Optimization for Complex Instruction Following

Instruction following (IF) is a critical capability for large language models (LLMs). However, handling complex instructions with multiple constraints remains challenging. Previous methods typically select preference pairs based on the…

Computation and Language · Computer Science 2025-05-29 Xiang Huang , Ting-En Lin , Feiteng Fang , Yuchuan Wu , Hangyu Li , Yuzhong Qu , Fei Huang , Yongbin Li

Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization

Direct Preference Optimization (DPO) and its variants have become the de facto standards for aligning large language models (LLMs) with human preferences or specific goals. However, DPO requires high-quality preference data and suffers from…

Machine Learning · Computer Science 2024-11-12 Zhuotong Chen , Fang Liu , Jennifer Zhu , Wanyu Du , Yanjun Qi

Conformal Feedback Alignment: Quantifying Answer-Level Reliability for Robust LLM Alignment

Preference-based alignment like Reinforcement Learning from Human Feedback (RLHF) learns from pairwise preferences, yet the labels are often noisy and inconsistent. Existing uncertainty-aware approaches weight preferences, but ignore a more…

Machine Learning · Computer Science 2026-01-27 Tiejin Chen , Xiaoou Liu , Vishnu Nandam , Kuan-Ru Liou , Hua Wei

ULMA: Unified Language Model Alignment with Human Demonstration and Point-wise Preference

Aligning language models to human expectations, e.g., being helpful and harmless, has become a pressing challenge for large language models. A typical alignment procedure consists of supervised fine-tuning and preference learning. Most…

Machine Learning · Computer Science 2024-02-27 Tianchi Cai , Xierui Song , Jiyan Jiang , Fei Teng , Jinjie Gu , Guannan Zhang

RSPO: Regularized Self-Play Alignment of Large Language Models

Self-play alignment has emerged as an effective approach for fine-tuning large language models (LLMs), formulating preference optimization as a two-player game. However, the regularization with respect to the reference policy, which is…

Machine Learning · Computer Science 2025-07-09 Xiaohang Tang , Sangwoong Yoon , Seongho Son , Huizhuo Yuan , Quanquan Gu , Ilija Bogunovic

Inverse-Q*: Token Level Reinforcement Learning for Aligning Large Language Models Without Preference Data

Reinforcement Learning from Human Feedback (RLHF) has proven effective in aligning large language models with human intentions, yet it often relies on complex methodologies like Proximal Policy Optimization (PPO) that require extensive…

Computation and Language · Computer Science 2024-08-30 Han Xia , Songyang Gao , Qiming Ge , Zhiheng Xi , Qi Zhang , Xuanjing Huang

Selective Preference Optimization via Token-Level Reward Function Estimation

Recent advancements in large language model alignment leverage token-level supervisions to perform fine-grained preference optimization. However, existing token-level alignment methods either optimize on all available tokens, which can be…

Computation and Language · Computer Science 2025-11-07 Kailai Yang , Zhiwei Liu , Qianqian Xie , Jimin Huang , Erxue Min , Sophia Ananiadou

Unifying Stable Optimization and Reference Regularization in RLHF

Reinforcement Learning from Human Feedback (RLHF) has advanced alignment capabilities significantly but remains hindered by two core challenges: \textbf{reward hacking} and \textbf{stable optimization}. Current solutions independently…

Machine Learning · Computer Science 2026-02-13 Li He , Qiang Qu , He Zhao , Stephen Wan , Dadong Wang , Lina Yao , Tongliang Liu

REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback

The effectiveness of reinforcement learning (RL) agents in continuous control robotics tasks is mainly dependent on the design of the underlying reward function, which is highly prone to reward hacking. A misalignment between the reward…

Robotics · Computer Science 2025-01-22 Souradip Chakraborty , Anukriti Singh , Amisha Bhaskar , Pratap Tokekar , Dinesh Manocha , Amrit Singh Bedi

UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality

Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models (LLMs) with human values. However, existing approaches struggle to capture the multi-dimensional, distributional nuances of human…

Computation and Language · Computer Science 2025-05-20 Zelei Cheng , Xin-Qiang Cai , Yuting Tang , Pushi Zhang , Boming Yang , Masashi Sugiyama , Xinyu Xing

Implicit Regularization in Feedback Alignment Learning Mechanisms for Neural Networks

Feedback Alignment (FA) methods are biologically inspired local learning rules for training neural networks with reduced communication between layers. While FA has potential applications in distributed and privacy-aware ML, limitations in…

Machine Learning · Computer Science 2024-06-05 Zachary Robertson , Oluwasanmi Koyejo

Mask the Target: A Plug-and-Play Regularizer Against LoRA Forgetting

Low-Rank Adaptation (LoRA) has become one of the most widely used fine-tuning mechanisms for adapting large language models to new domains, tasks, and users. Yet adaptation performance alone can obscure an important failure mode: LoRA…

Computation and Language · Computer Science 2026-05-29 Runze Xu , Arpit Garg , Hemanth Saratchandran , Simon Lucey

Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning

Rerankers play a pivotal role in refining retrieval results for Retrieval-Augmented Generation. However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream…

Computation and Language · Computer Science 2026-04-03 Yuhang Wu , Xiangqing Shen , Fanfan Wang , Cangqi Zhou , Zhen Wu , Xinyu Dai , Rui Xia