Related papers: Token-level Direct Preference Optimization

Token-Importance Guided Direct Preference Optimization

Aligning Large Language Models (LLMs) with human preferences is crucial for safe and effective AI interactions. While popular methods like Direct Preference Optimization (DPO) have simplified alignment, they remain sensitive to data noise…

Artificial Intelligence · Computer Science 2026-03-03 Ning Yang , Hai Lin , Yibo Liu , Baoliang Tian , Guoqing Liu , Haijun Zhang

Token-weighted Direct Preference Optimization with Attention

Direct Preference Optimization (DPO) aligns Large Language Models with human preferences without the need for a separate reward model. However, DPO treats all tokens in responses equally, neglecting the differing importance of individual…

Computation and Language · Computer Science 2026-05-27 Chengyu Huang , Zhuohang Li , Sheng-Yen Chou , Claire Cardie

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing…

Computation and Language · Computer Science 2026-05-15 Truong Nguyen , Tien-Phat Nguyen , Linh Ngo Van , Duy Minh Ho Nguyen , Khoa D. Doan , Trung Le

A Survey of Direct Preference Optimization

Large Language Models (LLMs) have demonstrated unprecedented generative capabilities, yet their alignment with human values remains critical for ensuring helpful and harmless deployments. While Reinforcement Learning from Human Feedback…

Machine Learning · Computer Science 2025-03-18 Shunyu Liu , Wenkai Fang , Zetian Hu , Junjie Zhang , Yang Zhou , Kongcheng Zhang , Rongcheng Tu , Ting-En Lin , Fei Huang , Mingli Song , Yongbin Li , Dacheng Tao

TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization

Recent advancements in reinforcement learning from human feedback have shown that utilizing fine-grained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO) in aligning large language…

Machine Learning · Computer Science 2025-06-18 Mingkang Zhu , Xi Chen , Zhongdao Wang , Bei Yu , Hengshuang Zhao , Jiaya Jia

Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints

The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but concurrently amplify safety concerns, such as potential misuse of AI systems, necessitating effective AI alignment.…

Machine Learning · Computer Science 2023-09-29 Chaoqi Wang , Yibo Jiang , Chenghao Yang , Han Liu , Yuxin Chen

TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees

In the domain of complex reasoning tasks, such as mathematical reasoning, recent advancements have proposed the use of Direct Preference Optimization (DPO) to suppress output of dispreferred responses, thereby enhancing the long-chain…

Computation and Language · Computer Science 2025-10-27 Weibin Liao , Xu Chu , Yasha Wang

TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights

Direct Preference Optimization (DPO) has been widely adopted for preference alignment of Large Language Models (LLMs) due to its simplicity and effectiveness. However, DPO is derived as a bandit problem in which the whole response is…

Computation and Language · Computer Science 2025-04-16 Aiwei Liu , Haoping Bai , Zhiyun Lu , Yanchao Sun , Xiang Kong , Simon Wang , Jiulong Shan , Albin Madappally Jose , Xiaojiang Liu , Lijie Wen , Philip S. Yu , Meng Cao

Aligning Large Vision-Language Models by Deep Reinforcement Learning and Direct Preference Optimization

Large Vision-Language Models (LVLMs) or multimodal large language models represent a significant advancement in artificial intelligence, enabling systems to understand and generate content across both visual and textual modalities. While…

Machine Learning · Computer Science 2025-09-09 Thanh Thi Nguyen , Campbell Wilson , Janis Dalins

Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective

Direct Preference Optimization (DPO) has gained attention as an efficient alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with human preferences. Despite its advantages, DPO suffers…

Computation and Language · Computer Science 2025-02-21 Ruichen Shao , Bei Li , Gangao Liu , Yang Chen , Xiang Zhou , Jingang Wang , Xunliang Cai , Peng Li

Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation

Direct Preference Optimization (DPO) has been demonstrated to be highly effective in mitigating hallucinations in Large Vision Language Models (LVLMs) by aligning their outputs more closely with human preferences. Despite the recent…

Computer Vision and Pattern Recognition · Computer Science 2025-09-24 Jihao Gu , Yingyao Wang , Meng Cao , Pi Bu , Jun Song , Yancheng He , Shilong Li , Bo Zheng

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

In the field of large language models (LLMs), aligning models with the diverse preferences of users is a critical challenge. Direct Preference Optimization (DPO) has played a key role in this area. It works by using pairs of preferences…

Computation and Language · Computer Science 2024-05-29 Yueqin Yin , Zhendong Wang , Yi Gu , Hai Huang , Weizhu Chen , Mingyuan Zhou

Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both

Traditional RLHF-based LLM alignment methods explicitly maximize the expected rewards from a separate reward model. More recent supervised alignment methods like Direct Preference Optimization (DPO) circumvent this phase to avoid problems…

Machine Learning · Computer Science 2025-02-03 Abhijnan Nath , Changsoo Jung , Ethan Seefried , Nikhil Krishnaswamy

Aligning CodeLLMs with Direct Preference Optimization

The last year has witnessed the rapid progress of large language models (LLMs) across diverse domains. Among them, CodeLLMs have garnered particular attention because they can not only assist in completing various programming tasks but also…

Artificial Intelligence · Computer Science 2024-10-25 Yibo Miao , Bofei Gao , Shanghaoran Quan , Junyang Lin , Daoguang Zan , Jiaheng Liu , Jian Yang , Tianyu Liu , Zhijie Deng

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an…

Artificial Intelligence · Computer Science 2025-07-15 Wenyi Xiao , Zechuan Wang , Leilei Gan , Shuai Zhao , Zongrui Li , Ruirui Lei , Wanggui He , Luu Anh Tuan , Long Chen , Hao Jiang , Zhou Zhao , Fei Wu

Multi-Reference Preference Optimization for Large Language Models

How can Large Language Models (LLMs) be aligned with human intentions and values? A typical solution is to gather human preference on model outputs and finetune the LLMs accordingly while ensuring that updates do not deviate too far from a…

Computation and Language · Computer Science 2024-05-28 Hung Le , Quan Tran , Dung Nguyen , Kien Do , Saloni Mittal , Kelechi Ogueji , Svetha Venkatesh

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Large Language Models (LLMs) have become increasingly popular due to their ability to process and generate natural language. However, as they are trained on massive datasets of text, LLMs can inherit harmful biases and produce outputs that…

Computation and Language · Computer Science 2025-01-23 Qi Gou , Cam-Tu Nguyen

Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence

Direct Preference Optimization (DPO) has emerged as a prominent algorithm for the direct and robust alignment of Large Language Models (LLMs) with human preferences, offering a more straightforward alternative to the complex Reinforcement…

Computation and Language · Computer Science 2024-12-10 Junru Lu , Jiazheng Li , Siyu An , Meng Zhao , Yulan He , Di Yin , Xing Sun

Tangent Space Fine-Tuning for Directional Preference Alignment in Large Language Models

Our goal is to enable large language models (LLMs) to balance multiple human preference dimensions; such as helpfulness, safety, and verbosity, through principled and controllable alignment. Existing preference optimization methods,…

Machine Learning · Computer Science 2026-02-03 Mete Erdogan

TLPO: Token-Level Policy Optimization for Mitigating Language Confusion in Large Language Models

Large language models (LLMs) demonstrate strong multilingual capabilities, yet often fail to consistently generate responses in the intended language, exhibiting a phenomenon known as language confusion. Prior mitigation approaches based on…

Computation and Language · Computer Science 2026-04-30 Jinho Choo , JunSeung Lee , Jimyeong Kim , Yeeho Song , S. K. Hong , Yeong-Dae Kwon