Related papers: DeAL: Decoding-time Alignment for Large Language M…

Decoding-time Realignment of Language Models

Aligning language models with human preferences is crucial for reducing errors and biases in these models. Alignment techniques, such as reinforcement learning from human feedback (RLHF), are typically cast as optimizing a tradeoff between…

Machine Learning · Computer Science 2024-05-27 Tianlin Liu , Shangmin Guo , Leonardo Bianco , Daniele Calandriello , Quentin Berthet , Felipe Llinares , Jessica Hoffmann , Lucas Dixon , Michal Valko , Mathieu Blondel

ALaRM: Align Language Models via Hierarchical Rewards Modeling

We introduce ALaRM, the first framework modeling hierarchical rewards in reinforcement learning from human feedback (RLHF), which is designed to enhance the alignment of large language models (LLMs) with human preferences. The framework…

Computation and Language · Computer Science 2024-03-19 Yuhang Lai , Siyuan Wang , Shujun Liu , Xuanjing Huang , Zhongyu Wei

Reward Modeling for Reinforcement Learning-Based LLM Reasoning: Design, Challenges, and Evaluation

Large Language Models (LLMs) demonstrate transformative potential, yet their reasoning remains inconsistent and unreliable. Reinforcement learning (RL)-based fine-tuning is a key mechanism for improvement, but its effectiveness is…

Machine Learning · Computer Science 2026-02-11 Pei-Chi Pan , Yingbin Liang , Sen Lin

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

State-of-the-art large language models (LLMs) have become indispensable tools for various tasks. However, training LLMs to serve as effective assistants for humans requires careful consideration. A promising approach is reinforcement…

Machine Learning · Computer Science 2024-04-17 Shreyas Chaudhari , Pranjal Aggarwal , Vishvak Murahari , Tanmay Rajpurohit , Ashwin Kalyan , Karthik Narasimhan , Ameet Deshpande , Bruno Castro da Silva

A Technical Survey of Reinforcement Learning Techniques for Large Language Models

Reinforcement Learning (RL) has emerged as a transformative approach for aligning and enhancing Large Language Models (LLMs), addressing critical challenges in instruction following, ethical alignment, and reasoning capabilities. This…

Artificial Intelligence · Computer Science 2025-07-08 Saksham Sahai Srivastava , Vaneet Aggarwal

SAIL: Self-Improving Efficient Online Alignment of Large Language Models

Reinforcement Learning from Human Feedback (RLHF) is a key method for aligning large language models (LLMs) with human preferences. However, current offline alignment approaches like DPO, IPO, and SLiC rely heavily on fixed preference…

Machine Learning · Computer Science 2024-06-25 Mucong Ding , Souradip Chakraborty , Vibhu Agrawal , Zora Che , Alec Koppel , Mengdi Wang , Amrit Bedi , Furong Huang

PAD: Personalized Alignment of LLMs at Decoding-Time

Aligning with personalized preferences, which vary significantly across cultural, educational, and political differences, poses a significant challenge due to the computational costs and data demands of traditional alignment methods. In…

Computation and Language · Computer Science 2025-03-14 Ruizhe Chen , Xiaotian Zhang , Meng Luo , Wenhao Chai , Zuozhu Liu

Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment

Recent advances in large language models (LLMs) have demonstrated significant progress in performing complex tasks. While Reinforcement Learning from Human Feedback (RLHF) has been effective in aligning LLMs with human preferences, it is…

Machine Learning · Computer Science 2025-05-30 Chaoqi Wang , Zhuokai Zhao , Yibo Jiang , Zhaorun Chen , Chen Zhu , Yuxin Chen , Jiayi Liu , Lizhu Zhang , Xiangjun Fan , Hao Ma , Sinong Wang

A Survey on Training-free Alignment of Large Language Models

The alignment of large language models (LLMs) aims to ensure their outputs adhere to human values, ethical standards, and legal norms. Traditional alignment methods often rely on resource-intensive fine-tuning (FT), which may suffer from…

Computation and Language · Computer Science 2025-09-11 Birong Pan , Yongqi Li , Weiyu Zhang , Wenpeng Lu , Mayi Xu , Shen Zhou , Yuanyuan Zhu , Ming Zhong , Tieyun Qian

ARGS: Alignment as Reward-Guided Search

Aligning large language models with human objectives is paramount, yet common approaches including RLHF suffer from unstable and resource-intensive training. In response to this challenge, we introduce ARGS, Alignment as Reward-Guided…

Computation and Language · Computer Science 2024-02-06 Maxim Khanov , Jirayu Burapacheep , Yixuan Li

MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time

Large Language Models (LLMs) acquire extensive knowledge and remarkable abilities from extensive text corpora, making them powerful tools for various applications. To make LLMs more usable, aligning them with human preferences is essential.…

Computation and Language · Computer Science 2024-10-21 Mozhi Zhang , Pengyu Wang , Chenkun Tan , Mianqiu Huang , Dong Zhang , Yaqian Zhou , Xipeng Qiu

Decoding Alignment: A Critical Survey of LLM Development Initiatives through Value-setting and Data-centric Lens

AI Alignment, primarily in the form of Reinforcement Learning from Human Feedback (RLHF), has been a cornerstone of the post-training phase in developing Large Language Models (LLMs). It has also been a popular research topic across various…

Computation and Language · Computer Science 2025-08-26 Ilias Chalkidis

Dense Reward for Free in Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) has been credited as the key advance that has allowed Large Language Models (LLMs) to effectively follow instructions and produce useful assistance. Classically, this involves generating…

Machine Learning · Computer Science 2024-02-02 Alex J. Chan , Hao Sun , Samuel Holt , Mihaela van der Schaar

The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback

Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) more capable in complex settings. RLHF proceeds as collecting human preference data, training a reward model on said…

Machine Learning · Computer Science 2024-02-05 Nathan Lambert , Roberto Calandra

Collab: Controlled Decoding using Mixture of Agents for LLM Alignment

Alignment of Large Language models (LLMs) is crucial for safe and trustworthy deployment in applications. Reinforcement learning from human feedback (RLHF) has emerged as an effective technique to align LLMs to human preferences and broader…

Computation and Language · Computer Science 2025-03-28 Souradip Chakraborty , Sujay Bhatt , Udari Madhushani Sehwag , Soumya Suvra Ghosal , Jiahao Qiu , Mengdi Wang , Dinesh Manocha , Furong Huang , Alec Koppel , Sumitra Ganesh

Reusing Embeddings: Reproducible Reward Model Research in Large Language Model Alignment without GPUs

Large Language Models (LLMs) have made substantial strides in structured tasks through Reinforcement Learning (RL), demonstrating proficiency in mathematical reasoning and code generation. However, applying RL in broader domains like…

Computation and Language · Computer Science 2025-02-10 Hao Sun , Yunyi Shen , Jean-Francois Ton , Mihaela van der Schaar

Decoding-Time Language Model Alignment with Multiple Objectives

Aligning language models (LMs) to human preferences has emerged as a critical pursuit, enabling these models to better serve diverse user needs. Existing methods primarily focus on optimizing LMs for a single reward function, limiting their…

Machine Learning · Computer Science 2024-10-29 Ruizhe Shi , Yifang Chen , Yushi Hu , Alisa Liu , Hannaneh Hajishirzi , Noah A. Smith , Simon S. Du

DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment

Reinforcement Learning from Human Feedback (RLHF), using algorithms like Proximal Policy Optimization (PPO), aligns Large Language Models (LLMs) with human values but is costly and unstable. Alternatives have been proposed to replace PPO or…

Computation and Language · Computer Science 2026-04-03 Liang Zhu , Feiteng Fang , Yuelin Bai , Longze Chen , Zhexiang Zhang , Minghuan Tan , Min Yang

Large Language Models and Algorithm Execution: Application to an Arithmetic Function

Large Language Models (LLMs) have recently developed new advanced functionalities. Their effectiveness relies on statistical learning and generalization capabilities. However, they face limitations in internalizing the data they process and…

Machine Learning · Computer Science 2026-01-14 Farah Ben Slama , Frédéric Armetta

A Survey on Progress in LLM Alignment from the Perspective of Reward Design

Reward design plays a pivotal role in aligning large language models (LLMs) with human values, serving as the bridge between feedback signals and model optimization. This survey provides a structured organization of reward modeling and…

Computation and Language · Computer Science 2025-09-03 Miaomiao Ji , Yanqiu Wu , Zhibin Wu , Shoujin Wang , Jian Yang , Mark Dras , Usman Naseem