Related papers: Learning from Reference Answers: Versatile Languag…

Beyond the Binary: Capturing Diverse Preferences With Reward Regularization

Large language models (LLMs) are increasingly deployed via public-facing interfaces to interact with millions of users, each with diverse preferences. Despite this, preference tuning of LLMs predominantly relies on reward models trained…

Computation and Language · Computer Science 2024-12-06 Vishakh Padmakumar , Chuanyang Jin , Hannah Rose Kirk , He He

Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment

Aligning language models (LMs) based on human-annotated preference data is a crucial step in obtaining practical and performant LM-based systems. However, multilingual human preference data are difficult to obtain at scale, making it…

Computation and Language · Computer Science 2024-10-15 Zhaofeng Wu , Ananth Balashankar , Yoon Kim , Jacob Eisenstein , Ahmad Beirami

OpenGenAlign: A Preference Dataset and Benchmark for Trustworthy Reward Modeling in Open-Ended, Long-Context Generation

Reward Modeling is critical in evaluating and improving the generation of Large Language Models (LLMs). While numerous recent works have shown its feasibility in improving safety, helpfulness, reasoning, and instruction-following ability,…

Computation and Language · Computer Science 2025-11-13 Hanning Zhang , Juntong Song , Juno Zhu , Yuanhao Wu , Tong Zhang , Cheng Niu

GRAM: A Generative Foundation Reward Model for Reward Generalization

In aligning large language models (LLMs), reward models have played an important role, but are standardly trained as discriminative models and rely only on labeled human preference data. In this paper, we explore methods that train reward…

Computation and Language · Computer Science 2026-01-27 Chenglong Wang , Yang Gan , Yifu Huo , Yongyu Mu , Qiaozhi He , Murun Yang , Bei Li , Tong Xiao , Chunliang Zhang , Tongran Liu , Jingbo Zhu

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

Preference alignment in Large Language Models (LLMs) has significantly improved their ability to adhere to human instructions and intentions. However, existing direct alignment algorithms primarily focus on relative preferences and often…

Machine Learning · Computer Science 2025-05-13 Shenao Zhang , Zhihan Liu , Boyi Liu , Yufeng Zhang , Yingxiang Yang , Yongfei Liu , Liyu Chen , Tao Sun , Zhaoran Wang

RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment

Despite the significant progress made by existing retrieval augmented language models (RALMs) in providing trustworthy responses and grounding in reliable sources, they often overlook effective alignment with human preferences. In the…

Computation and Language · Computer Science 2024-12-19 Zhuoran Jin , Hongbang Yuan , Tianyi Men , Pengfei Cao , Yubo Chen , Kang Liu , Jun Zhao

Energy-Based Reward Models for Robust Language Model Alignment

Reward models (RMs) are essential for aligning Large Language Models (LLMs) with human preferences. However, they often struggle with capturing complex human preferences and generalizing to unseen data. To address these challenges, we…

Computation and Language · Computer Science 2025-08-06 Anamika Lochab , Ruqi Zhang

CREAM: Consistency Regularized Self-Rewarding Language Models

Recent self-rewarding large language models (LLM) have successfully applied LLM-as-a-Judge to iteratively improve the alignment performance without the need of human annotations for preference data. These methods commonly utilize the same…

Machine Learning · Computer Science 2025-04-29 Zhaoyang Wang , Weilei He , Zhiyuan Liang , Xuchao Zhang , Chetan Bansal , Ying Wei , Weitong Zhang , Huaxiu Yao

Towards Reliable, Uncertainty-Aware Alignment

Alignment of large language models (LLMs) typically involves training a reward model on preference data, followed by policy optimization with respect to the reward model. However, optimizing policies with respect to a single reward model…

Machine Learning · Computer Science 2025-07-23 Debangshu Banerjee , Kintan Saha , Aditya Gopalan

Fine-Tuning Language Models from Human Preferences

Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments,…

Computation and Language · Computer Science 2020-01-10 Daniel M. Ziegler , Nisan Stiennon , Jeffrey Wu , Tom B. Brown , Alec Radford , Dario Amodei , Paul Christiano , Geoffrey Irving

GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment

Large Language Models (LLMs) exhibit impressive capabilities but require careful alignment with human preferences. Traditional training-time methods finetune LLMs using human preference datasets but incur significant training costs and…

Computation and Language · Computer Science 2025-07-16 Yuancheng Xu , Udari Madhushani Sehwag , Alec Koppel , Sicheng Zhu , Bang An , Furong Huang , Sumitra Ganesh

RewardAnything: Generalizable Principle-Following Reward Models

Reward Models, essential for guiding Large Language Model optimization, are typically trained on fixed preference datasets, resulting in rigid alignment to single, implicit preference distributions. This prevents adaptation to diverse…

Computation and Language · Computer Science 2025-07-08 Zhuohao Yu , Jiali Zeng , Weizheng Gu , Yidong Wang , Jindong Wang , Fandong Meng , Jie Zhou , Yue Zhang , Shikun Zhang , Wei Ye

Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models (LLMs) to adhere to human intentions. Unlike offline alignment with a fixed…

Machine Learning · Computer Science 2024-11-06 Shenao Zhang , Donghan Yu , Hiteshi Sharma , Han Zhong , Zhihan Liu , Ziyi Yang , Shuohang Wang , Hany Hassan , Zhaoran Wang

Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

Aligning the behavior of Large language models (LLMs) with human intentions and values remains a critical challenge. Reinforcement learning from human feedback (RLHF) aligns LLMs by training a reward model (RM) on human preferences and…

Computation and Language · Computer Science 2025-12-25 Jiayi Zhou , Jiaming Ji , Juntao Dai , Dong Li , Yaodong Yang

Uncertainty Quantification for Large Language Model Reward Learning under Heterogeneous Human Feedback

We study estimation and statistical inference for reward models used in aligning large language models (LLMs). A key component of LLM alignment is reinforcement learning from human feedback (RLHF), where humans compare pairs of…

Machine Learning · Statistics 2025-12-04 Pangpang Liu , Junwei Lu , Will Wei Sun

Reusing Embeddings: Reproducible Reward Model Research in Large Language Model Alignment without GPUs

Large Language Models (LLMs) have made substantial strides in structured tasks through Reinforcement Learning (RL), demonstrating proficiency in mathematical reasoning and code generation. However, applying RL in broader domains like…

Computation and Language · Computer Science 2025-02-10 Hao Sun , Yunyi Shen , Jean-Francois Ton , Mihaela van der Schaar

Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment

Recent advances in large language models (LLMs) have demonstrated significant progress in performing complex tasks. While Reinforcement Learning from Human Feedback (RLHF) has been effective in aligning LLMs with human preferences, it is…

Machine Learning · Computer Science 2025-05-30 Chaoqi Wang , Zhuokai Zhao , Yibo Jiang , Zhaorun Chen , Chen Zhu , Yuxin Chen , Jiayi Liu , Lizhu Zhang , Xiangjun Fan , Hao Ma , Sinong Wang

MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time

Large Language Models (LLMs) acquire extensive knowledge and remarkable abilities from extensive text corpora, making them powerful tools for various applications. To make LLMs more usable, aligning them with human preferences is essential.…

Computation and Language · Computer Science 2024-10-21 Mozhi Zhang , Pengyu Wang , Chenkun Tan , Mianqiu Huang , Dong Zhang , Yaqian Zhou , Xipeng Qiu

Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language Model Critique in Text Generation

Reinforcement learning (RL) can align language models with non-differentiable reward signals, such as human preferences. However, a major challenge arises from the sparsity of these reward signals - typically, there is only a single reward…

Computation and Language · Computer Science 2024-02-20 Meng Cao , Lei Shu , Lei Yu , Yun Zhu , Nevan Wichers , Yinxiao Liu , Lei Meng

Aligning Crowd Feedback via Distributional Preference Reward Modeling

Deep Reinforcement Learning is widely used for aligning Large Language Models (LLM) with human preference. However, the conventional reward modelling is predominantly dependent on human annotations provided by a select cohort of…

Artificial Intelligence · Computer Science 2024-05-31 Dexun Li , Cong Zhang , Kuicai Dong , Derrick Goh Xin Deik , Ruiming Tang , Yong Liu