Related papers: Selective Preference Optimization via Token-Level …

Token-weighted Direct Preference Optimization with Attention

Direct Preference Optimization (DPO) aligns Large Language Models with human preferences without the need for a separate reward model. However, DPO treats all tokens in responses equally, neglecting the differing importance of individual…

Computation and Language · Computer Science 2026-05-27 Chengyu Huang , Zhuohang Li , Sheng-Yen Chou , Claire Cardie

TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization

Recent advancements in reinforcement learning from human feedback have shown that utilizing fine-grained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO) in aligning large language…

Machine Learning · Computer Science 2025-06-18 Mingkang Zhu , Xi Chen , Zhongdao Wang , Bei Yu , Hengshuang Zhao , Jiaya Jia

Token-Importance Guided Direct Preference Optimization

Aligning Large Language Models (LLMs) with human preferences is crucial for safe and effective AI interactions. While popular methods like Direct Preference Optimization (DPO) have simplified alignment, they remain sensitive to data noise…

Artificial Intelligence · Computer Science 2026-03-03 Ning Yang , Hai Lin , Yibo Liu , Baoliang Tian , Guoqing Liu , Haijun Zhang

TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation

Direct Preference Optimization is an offline post-SFT method for aligning language models from preference pairs, with strong results in instruction following and summarization. However, DPO's sequence-level implicit reward can be brittle…

Computation and Language · Computer Science 2026-03-03 Samah Fodeh , Linhai Ma , Ganesh Puthiaraju , Srivani Talakokkul , Afshan Khan , Ashley Hagaman , Sarah R. Lowe , Aimee Kendall Roundtree

Token-level Direct Preference Optimization

Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the…

Computation and Language · Computer Science 2024-09-02 Yongcheng Zeng , Guoqing Liu , Weiyu Ma , Ning Yang , Haifeng Zhang , Jun Wang

Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization

Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models (LLMs) with human preferences by directly optimizing the log-likelihood difference between chosen and rejected responses. However,…

Computation and Language · Computer Science 2025-05-27 Meng Li , Guangda Huzhang , Haibo Zhang , Xiting Wang , Anxiang Zeng

Not All Preferences are What You Need for Post-Training: Selective Alignment Strategy for Preference Optimization

Post-training alignment of large language models (LLMs) is a critical challenge, as not all tokens contribute equally to model performance. This paper introduces a selective alignment strategy that prioritizes high-impact tokens within…

Computation and Language · Computer Science 2025-07-11 Zhijin Dong

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing…

Computation and Language · Computer Science 2026-05-15 Truong Nguyen , Tien-Phat Nguyen , Linh Ngo Van , Duy Minh Ho Nguyen , Khoa D. Doan , Trung Le

ASPO: Adaptive Sentence-Level Preference Optimization for Fine-Grained Multimodal Reasoning

Direct Preference Optimization (DPO) has gained significant attention for its simplicity and computational efficiency in aligning large language models (LLMs). Recent advancements have extended DPO to multimodal scenarios, achieving strong…

Computation and Language · Computer Science 2025-05-27 Yeyuan Wang , Dehong Gao , Rujiao Long , Lei Yi , Linbo Jin , Libin Yang , Xiaoyan Cai

Sequence-level Large Language Model Training with Contrastive Preference Optimization

The next token prediction loss is the dominant self-supervised training objective for large language models and has achieved promising results in a variety of downstream tasks. However, upon closer investigation of this objective, we find…

Computation and Language · Computer Science 2025-02-25 Zhili Feng , Dhananjay Ram , Cole Hawkins , Aditya Rawal , Jinman Zhao , Sheng Zha

Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation

Direct Preference Optimization (DPO) has been demonstrated to be highly effective in mitigating hallucinations in Large Vision Language Models (LVLMs) by aligning their outputs more closely with human preferences. Despite the recent…

Computer Vision and Pattern Recognition · Computer Science 2025-09-24 Jihao Gu , Yingyao Wang , Meng Cao , Pi Bu , Jun Song , Yancheng He , Shilong Li , Bo Zheng

TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights

Direct Preference Optimization (DPO) has been widely adopted for preference alignment of Large Language Models (LLMs) due to its simplicity and effectiveness. However, DPO is derived as a bandit problem in which the whole response is…

Computation and Language · Computer Science 2025-04-16 Aiwei Liu , Haoping Bai , Zhiyun Lu , Yanchao Sun , Xiang Kong , Simon Wang , Jiulong Shan , Albin Madappally Jose , Xiaojiang Liu , Lijie Wen , Philip S. Yu , Meng Cao

Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective

Direct Preference Optimization (DPO) has gained attention as an efficient alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with human preferences. Despite its advantages, DPO suffers…

Computation and Language · Computer Science 2025-02-21 Ruichen Shao , Bei Li , Gangao Liu , Yang Chen , Xiang Zhou , Jingang Wang , Xunliang Cai , Peng Li

TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees

In the domain of complex reasoning tasks, such as mathematical reasoning, recent advancements have proposed the use of Direct Preference Optimization (DPO) to suppress output of dispreferred responses, thereby enhancing the long-chain…

Computation and Language · Computer Science 2025-10-27 Weibin Liao , Xu Chu , Yasha Wang

Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

Recently, there has been significant interest in replacing the reward model in Reinforcement Learning with Human Feedback (RLHF) methods for Large Language Models (LLMs), such as Direct Preference Optimization (DPO) and its variants. These…

Computation and Language · Computer Science 2024-09-27 Jian Li , Haojing Huang , Yujia Zhang , Pengfei Xu , Xi Chen , Rui Song , Lida Shi , Jingwen Wang , Hao Xu

BPO: Revisiting Preference Modeling in Direct Preference Optimization

Direct Preference Optimization (DPO) have emerged as a popular method for aligning Large Language Models (LLMs) with human preferences. While DPO effectively preserves the relative ordering between chosen and rejected responses through…

Computation and Language · Computer Science 2025-06-05 Lin Sun , Chuang Liu , Peng Liu , Bingyang Li , Weijia Lu , Ning Wu

SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling

Human preference alignment is critical in building powerful and reliable large language models (LLMs). However, current methods either ignore the multi-dimensionality of human preferences (e.g. helpfulness and harmlessness) or struggle with…

Machine Learning · Computer Science 2024-10-14 Xingzhou Lou , Junge Zhang , Jian Xie , Lifeng Liu , Dong Yan , Kaiqi Huang

Step-level Value Preference Optimization for Mathematical Reasoning

Direct Preference Optimization (DPO) using an implicit reward model has proven to be an effective alternative to reinforcement learning from human feedback (RLHF) for fine-tuning preference aligned large language models (LLMs). However, the…

Computation and Language · Computer Science 2024-09-30 Guoxin Chen , Minpeng Liao , Chengxi Li , Kai Fan

Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization

Direct Preference Optimization (DPO) and its variants have become the de facto standards for aligning large language models (LLMs) with human preferences or specific goals. However, DPO requires high-quality preference data and suffers from…

Machine Learning · Computer Science 2024-11-12 Zhuotong Chen , Fang Liu , Jennifer Zhu , Wanyu Du , Yanjun Qi

Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood

Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly in their mathemat ical reasoning performance. However, GRPO and related entropy regularization methods…

Computation and Language · Computer Science 2026-04-15 Xingyu Lin , Yilin Wen , Du Su , Jinchang Hou , En Wang , Wenbin Liu , Chenfu Bao , Zhonghou Lv