Related papers: Adaptive Decoding via Latent Preference Optimizati…

Hot or Cold? Adaptive Temperature Sampling for Code Generation with Large Language Models

Recently, Large Language Models (LLMs) have shown impressive abilities in code generation. However, existing LLMs' decoding strategies are designed for Natural Language (NL) generation, overlooking the differences between NL and programming…

Software Engineering · Computer Science 2023-12-29 Yuqi Zhu , Jia Li , Ge Li , YunFei Zhao , Jia Li , Zhi Jin , Hong Mei

Learning Adaptive LLM Decoding

Decoding from large language models (LLMs) typically relies on fixed sampling hyperparameters (e.g., temperature, top-p), despite substantial variation in task difficulty and uncertainty across prompts and individual decoding steps. We…

Machine Learning · Computer Science 2026-03-17 Chloe H. Su , Zhe Ye , Samuel Tenka , Aidan Yang , Soonho Kong , Udaya Ghai

Temperature as a Meta-Policy: Adaptive Temperature in LLM Reinforcement Learning

Temperature is a crucial hyperparameter in large language models (LLMs), controlling the trade-off between exploration and exploitation during text generation. High temperatures encourage diverse but noisy outputs, while low temperatures…

Machine Learning · Computer Science 2026-02-13 Haoran Dang , Cuiling Lan , Hai Wan , Xibin Zhao , Yan Lu

Control the Temperature: Selective Sampling for Diverse and High-Quality LLM Outputs

Diversity is an essential metric for evaluating the creativity of outputs generated by language models. Temperature-based sampling is a common strategy to increase diversity. However, for tasks that require high precision, e.g.,…

Machine Learning · Computer Science 2025-10-03 Sergey Troshin , Wafaa Mohammed , Yan Meng , Christof Monz , Antske Fokkens , Vlad Niculae

Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization

DPO (Direct Preference Optimization) has become a widely used offline preference optimization algorithm due to its simplicity and training stability. However, DPO is prone to overfitting and collapse. To address these challenges, we propose…

Machine Learning · Computer Science 2025-08-26 Rui Wang , Qianguo Sun , Chao Song , Junlong Wu , Tianrong Chen , Zhiyun Zeng , Yu Li

Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically use Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when…

Computer Vision and Pattern Recognition · Computer Science 2025-10-03 Tao Zhang , Cheng Da , Kun Ding , Huan Yang , Kun Jin , Yan Li , Tingting Gao , Di Zhang , Shiming Xiang , Chunhong Pan

Learning to Align Human Code Preferences

Large Language Models (LLMs) have demonstrated remarkable potential in automating software development tasks. While recent advances leverage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align models with human…

Software Engineering · Computer Science 2025-12-09 Xin Yin , Chao Ni , Xiaohu Yang

AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic-free refinement…

Machine Learning · Computer Science 2026-05-21 Miaobo Hu , Shuhao Hu , Bokun Wang , Ruohan Wang , Xin Wang , Xiaobo Guo , Daren Zha , Jun Xiao

Contextual Temperature for Language Modeling

Temperature scaling has been widely used as an effective approach to control the smoothness of a distribution, which helps the model performance in various tasks. Current practices to apply temperature scaling assume either a fixed, or a…

Computation and Language · Computer Science 2020-12-29 Pei-Hsin Wang , Sheng-Iou Hsieh , Shih-Chieh Chang , Yu-Ting Chen , Jia-Yu Pan , Wei Wei , Da-Chang Juan

CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation

Large language models (LLMs) have shown great potential in natural language processing tasks, but their application to machine translation (MT) remains challenging due to pretraining on English-centric data and the complexity of…

Computation and Language · Computer Science 2025-01-24 Guofeng Cui , Pichao Wang , Yang Liu , Zemian Ke , Zhu Liu , Vimal Bhat

Optimizing Temperature for Language Models with Multi-Sample Inference

Multi-sample aggregation strategies, such as majority voting and best-of-N sampling, are widely used in contemporary large language models (LLMs) to enhance predictive accuracy across various tasks. A key challenge in this process is…

Machine Learning · Computer Science 2025-06-17 Weihua Du , Yiming Yang , Sean Welleck

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing…

Machine Learning · Computer Science 2024-07-31 Rafael Rafailov , Archit Sharma , Eric Mitchell , Stefano Ermon , Christopher D. Manning , Chelsea Finn

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns…

Computation and Language · Computer Science 2025-01-23 Yafu Li , Xuyang Hu , Xiaoye Qu , Linjie Li , Yu Cheng

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

In the field of large language models (LLMs), aligning models with the diverse preferences of users is a critical challenge. Direct Preference Optimization (DPO) has played a key role in this area. It works by using pairs of preferences…

Computation and Language · Computer Science 2024-05-29 Yueqin Yin , Zhendong Wang , Yi Gu , Hai Huang , Weizhu Chen , Mingyuan Zhou

The End of Manual Decoding: Towards Truly End-to-End Language Models

The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel…

Computation and Language · Computer Science 2025-11-03 Zhichao Wang , Dongyang Ma , Xinting Huang , Deng Cai , Tian Lan , Jiahao Xu , Haitao Mi , Xiaoying Tang , Yan Wang

Temperature-Centric Investigation of Speculative Decoding with Knowledge Distillation

Speculative decoding stands as a pivotal technique to expedite inference in autoregressive (large) language models. This method employs a smaller draft model to speculate a block of tokens, which the target model then evaluates for…

Computation and Language · Computer Science 2024-10-15 Siru Ouyang , Shuohang Wang , Minhao Jiang , Ming Zhong , Donghan Yu , Jiawei Han , Yelong Shen

Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective

Direct Preference Optimization (DPO) has gained attention as an efficient alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with human preferences. Despite its advantages, DPO suffers…

Computation and Language · Computer Science 2025-02-21 Ruichen Shao , Bei Li , Gangao Liu , Yang Chen , Xiang Zhou , Jingang Wang , Xunliang Cai , Peng Li

Personalized LLM Decoding via Contrasting Personal Preference

As large language models (LLMs) are progressively deployed in various real-world applications, personalization of LLMs has become increasingly important. While various approaches to LLM personalization such as prompt-based and…

Computation and Language · Computer Science 2025-11-25 Hyungjune Bu , Chanjoo Jung , Minjae Kang , Jaehyung Kim

Teaching Your Models to Understand Code via Focal Preference Alignment

Preference learning extends the performance of Code LLMs beyond traditional supervised fine-tuning by leveraging relative quality comparisons. In existing approaches, a set of n candidate solutions is evaluated based on test case success…

Computation and Language · Computer Science 2025-10-10 Jie Wu , Haoling Li , Xin Zhang , Xiao Liu , Yangyu Huang , Jianwen Luo , Yizhen Zhang , Zuchao Li , Ruihang Chu , Yujiu Yang , Scarlett Li

Diverse Preference Optimization

Post-training of language models, either through reinforcement learning, preference optimization or supervised finetuning, tends to sharpen the output probability distribution and reduce the diversity of generated responses. This is…

Computation and Language · Computer Science 2025-05-23 Jack Lanchantin , Angelica Chen , Shehzaad Dhuliawala , Ping Yu , Jason Weston , Sainbayar Sukhbaatar , Ilia Kulikov