English

InfAlign: Inference-aware language model alignment

Machine Learning 2025-08-22 v5 Computation and Language Information Theory math.IT

Abstract

Language model alignment is a critical step in training modern generative language models. Alignment targets to improve win rate of a sample from the aligned model against the base model. Today, we are increasingly using inference-time algorithms (e.g., Best-of-N, controlled decoding, tree search) to decode from language models rather than standard sampling. We show that this train/test mismatch makes standard RLHF framework sub-optimal in view of such inference-time methods. To this end, we propose a framework for inference-aware alignment (InfAlign), which aims to optimize inference-time win rate of the aligned policy against the base model. We prove that for any inference-time decoding procedure, the optimal aligned policy is the solution to the standard RLHF problem with a transformation of the reward. This motivates us to provide the calibrate-and-transform RL (InfAlign-CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. For best-of-N sampling and best-of-N jailbreaking, we propose specific transformations offering up to 3-8% improvement on inference-time win rates. Finally, we also show that our proposed reward calibration method is a strong baseline for optimizing standard win rate.

Keywords

Cite

@article{arxiv.2412.19792,
  title  = {InfAlign: Inference-aware language model alignment},
  author = {Ananth Balashankar and Ziteng Sun and Jonathan Berant and Jacob Eisenstein and Michael Collins and Adrian Hutter and Jong Lee and Chirag Nagpal and Flavien Prost and Aradhana Sinha and Ananda Theertha Suresh and Ahmad Beirami},
  journal= {arXiv preprint arXiv:2412.19792},
  year   = {2025}
}