Related papers: InfAlign: Inference-aware language model alignment

Inference Time Alignment with Reward-Guided Tree Search

Inference-time computation methods enhance the performance of Large Language Models (LLMs) by leveraging additional computational resources to achieve superior results. Common techniques, such as Best-of-N sampling, Majority Voting, and…

Computation and Language · Computer Science 2024-11-27 Chia-Yu Hung , Navonil Majumder , Ambuj Mehrish , Soujanya Poria

Solving the Inverse Alignment Problem for Efficient RLHF

Collecting high-quality preference datasets for reinforcement learning from human feedback (RLHF) is resource-intensive and challenging. As a result, researchers often train reward models on extensive offline datasets which aggregate…

Machine Learning · Computer Science 2024-12-17 Shambhavi Krishna , Aishwarya Sahoo

Reward Shaping for Inference-Time Alignment: A Stackelberg Game Perspective

Existing alignment methods directly use the reward model learned from user preference data to optimize an LLM policy, subject to KL regularization with respect to the base policy. This practice is suboptimal for maximizing user's utility…

Machine Learning · Computer Science 2026-02-04 Haichuan Wang , Tao Lin , Lingkai Kong , Ce Li , Hezi Jiang , Milind Tambe

Towards Reliable Alignment: Uncertainty-aware RLHF

Recent advances in aligning Large Language Models with human preferences have benefited from larger reward models and better preference data. However, most of these methodologies rely on the accuracy of the reward model. The reward models…

Artificial Intelligence · Computer Science 2024-11-01 Debangshu Banerjee , Aditya Gopalan

Revisiting the (Sub)Optimality of Best-of-N for Inference-Time Alignment

Best-of-N (BoN) sampling is a widely used inference-time alignment method for language models, whereby N candidate responses are sampled from a reference model and the one with the highest predicted reward according to a learned reward…

Machine Learning · Computer Science 2026-03-09 Ved Sriraman , Adam Block

Theoretical Limits of Language Model Alignment

Language model (LM) alignment improves model outputs to reflect human preferences while preserving the capabilities of the base model. The most common alignment approaches are (i) reinforcement learning, which maximizes the expected reward…

Machine Learning · Computer Science 2026-05-11 Lucas Monteiro Paes , Natalie Mackraz , Barry-John Theobald , Federico Danieli

AdaBoN: Adaptive Best-of-N Alignment

Recent advances in test-time alignment methods, such as Best-of-N sampling, offer a simple and effective way to steer language models (LMs) toward preferred behaviors using reward models (RM). However, these approaches can be…

Computation and Language · Computer Science 2026-03-16 Vinod Raman , Hilal Asi , Satyen Kale

Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment

Many imitation learning (IL) algorithms use inverse reinforcement learning (IRL) to infer a reward function that aligns with the demonstration. However, the inferred reward functions often fail to capture the underlying task objectives. In…

Machine Learning · Computer Science 2024-11-01 Weichao Zhou , Wenchao Li

ARGS: Alignment as Reward-Guided Search

Aligning large language models with human objectives is paramount, yet common approaches including RLHF suffer from unstable and resource-intensive training. In response to this challenge, we introduce ARGS, Alignment as Reward-Guided…

Computation and Language · Computer Science 2024-02-06 Maxim Khanov , Jirayu Burapacheep , Yixuan Li

HyperAlign: Hypernetwork for Efficient Test-Time Alignment of Diffusion Models

Diffusion model alignment aims to bridge the gap between generated outputs and human preferences by enhancing both semantic consistency with textual prompts and overall visual quality. Existing alignment methods face a challenging…

Computer Vision and Pattern Recognition · Computer Science 2026-04-01 Xin Xie , Jiaxian Guo , Dong Gong

HAF-RM: A Hybrid Alignment Framework for Reward Model Training

The reward model has become increasingly important in alignment, assessment, and data construction for large language models (LLMs). Most existing researchers focus on enhancing reward models through data improvements, following the…

Computation and Language · Computer Science 2025-01-09 Shujun Liu , Xiaoyu Shen , Yuhang Lai , Siyuan Wang , Shengbin Yue , Zengfeng Huang , Xuanjing Huang , Zhongyu Wei

Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment

Inference-time computation offers a powerful axis for scaling the performance of language models. However, naively increasing computation in techniques like Best-of-N sampling can lead to performance degradation due to reward hacking.…

Artificial Intelligence · Computer Science 2025-04-09 Audrey Huang , Adam Block , Qinghua Liu , Nan Jiang , Akshay Krishnamurthy , Dylan J. Foster

ALaRM: Align Language Models via Hierarchical Rewards Modeling

We introduce ALaRM, the first framework modeling hierarchical rewards in reinforcement learning from human feedback (RLHF), which is designed to enhance the alignment of large language models (LLMs) with human preferences. The framework…

Computation and Language · Computer Science 2024-03-19 Yuhang Lai , Siyuan Wang , Shujun Liu , Xuanjing Huang , Zhongyu Wei

InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance

With the rapid development of large language models (LLMs), they are not only used as general-purpose AI assistants but are also customized through further fine-tuning to meet the requirements of different applications. A pivotal factor in…

Computation and Language · Computer Science 2024-01-23 Pengyu Wang , Dong Zhang , Linyang Li , Chenkun Tan , Xinghao Wang , Ke Ren , Botian Jiang , Xipeng Qiu

Language Model Alignment with Elastic Reset

Finetuning language models with reinforcement learning (RL), e.g. from human feedback (HF), is a prominent method for alignment. But optimizing against a reward model can improve on reward while degrading performance in other areas, a…

Computation and Language · Computer Science 2023-12-14 Michael Noukhovitch , Samuel Lavoie , Florian Strub , Aaron Courville

Targeting Misalignment: A Conflict-Aware Framework for Reward-Model-based LLM Alignment

Reward-model-based fine-tuning is a central paradigm in aligning Large Language Models with human preferences. However, such approaches critically rely on the assumption that proxy reward models accurately reflect intended supervision, a…

Computation and Language · Computer Science 2026-01-21 Zixuan Liu , Siavash H. Khajavi , Guangkai Jiang , Xinru Liu

Compute Aligned Training: Optimizing for Test Time Inference

Scaling test-time compute has emerged as a powerful mechanism for enhancing Large Language Model (LLM) performance. However, standard post-training paradigms, Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), optimize the…

Machine Learning · Computer Science 2026-05-21 Adam Ousherovitch , Ambuj Tewari

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values. However, RLHF relies on a reward model that is trained with a limited amount of human preference data,…

Machine Learning · Computer Science 2024-10-23 Shun Zhang , Zhenfang Chen , Sunli Chen , Yikang Shen , Zhiqing Sun , Chuang Gan

Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

Aligning human preference and value is an important requirement for building contemporary foundation models and embodied AI. However, popular approaches such as reinforcement learning with human feedback (RLHF) break down the task into…

Artificial Intelligence · Computer Science 2024-12-03 Chenliang Li , Siliang Zeng , Zeyi Liao , Jiaxiang Li , Dongyeop Kang , Alfredo Garcia , Mingyi Hong

Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time

Aligning large language models with humans is challenging due to the inherently multifaceted nature of preference feedback. While existing approaches typically frame this as a multi-objective optimization problem, they often overlook how…

Computation and Language · Computer Science 2025-06-03 Mohamad Chehade , Soumya Suvra Ghosal , Souradip Chakraborty , Avinash Reddy , Dinesh Manocha , Hao Zhu , Amrit Singh Bedi