English
Related papers

Related papers: Selective Preference Optimization via Token-Level …

200 papers

Direct Preference Optimization (DPO) aligns Large Language Models with human preferences without the need for a separate reward model. However, DPO treats all tokens in responses equally, neglecting the differing importance of individual…

Computation and Language · Computer Science 2026-05-27 Chengyu Huang , Zhuohang Li , Sheng-Yen Chou , Claire Cardie

Recent advancements in reinforcement learning from human feedback have shown that utilizing fine-grained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO) in aligning large language…

Machine Learning · Computer Science 2025-06-18 Mingkang Zhu , Xi Chen , Zhongdao Wang , Bei Yu , Hengshuang Zhao , Jiaya Jia

Aligning Large Language Models (LLMs) with human preferences is crucial for safe and effective AI interactions. While popular methods like Direct Preference Optimization (DPO) have simplified alignment, they remain sensitive to data noise…

Artificial Intelligence · Computer Science 2026-03-03 Ning Yang , Hai Lin , Yibo Liu , Baoliang Tian , Guoqing Liu , Haijun Zhang

Direct Preference Optimization is an offline post-SFT method for aligning language models from preference pairs, with strong results in instruction following and summarization. However, DPO's sequence-level implicit reward can be brittle…

Computation and Language · Computer Science 2026-03-03 Samah Fodeh , Linhai Ma , Ganesh Puthiaraju , Srivani Talakokkul , Afshan Khan , Ashley Hagaman , Sarah R. Lowe , Aimee Kendall Roundtree

Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the…

Computation and Language · Computer Science 2024-09-02 Yongcheng Zeng , Guoqing Liu , Weiyu Ma , Ning Yang , Haifeng Zhang , Jun Wang

Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models (LLMs) with human preferences by directly optimizing the log-likelihood difference between chosen and rejected responses. However,…

Computation and Language · Computer Science 2025-05-27 Meng Li , Guangda Huzhang , Haibo Zhang , Xiting Wang , Anxiang Zeng

Post-training alignment of large language models (LLMs) is a critical challenge, as not all tokens contribute equally to model performance. This paper introduces a selective alignment strategy that prioritizes high-impact tokens within…

Computation and Language · Computer Science 2025-07-11 Zhijin Dong

Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing…

Computation and Language · Computer Science 2026-05-15 Truong Nguyen , Tien-Phat Nguyen , Linh Ngo Van , Duy Minh Ho Nguyen , Khoa D. Doan , Trung Le

Direct Preference Optimization (DPO) has gained significant attention for its simplicity and computational efficiency in aligning large language models (LLMs). Recent advancements have extended DPO to multimodal scenarios, achieving strong…

Computation and Language · Computer Science 2025-05-27 Yeyuan Wang , Dehong Gao , Rujiao Long , Lei Yi , Linbo Jin , Libin Yang , Xiaoyan Cai

The next token prediction loss is the dominant self-supervised training objective for large language models and has achieved promising results in a variety of downstream tasks. However, upon closer investigation of this objective, we find…

Computation and Language · Computer Science 2025-02-25 Zhili Feng , Dhananjay Ram , Cole Hawkins , Aditya Rawal , Jinman Zhao , Sheng Zha

Direct Preference Optimization (DPO) has been demonstrated to be highly effective in mitigating hallucinations in Large Vision Language Models (LVLMs) by aligning their outputs more closely with human preferences. Despite the recent…

Computer Vision and Pattern Recognition · Computer Science 2025-09-24 Jihao Gu , Yingyao Wang , Meng Cao , Pi Bu , Jun Song , Yancheng He , Shilong Li , Bo Zheng

Direct Preference Optimization (DPO) has been widely adopted for preference alignment of Large Language Models (LLMs) due to its simplicity and effectiveness. However, DPO is derived as a bandit problem in which the whole response is…

Computation and Language · Computer Science 2025-04-16 Aiwei Liu , Haoping Bai , Zhiyun Lu , Yanchao Sun , Xiang Kong , Simon Wang , Jiulong Shan , Albin Madappally Jose , Xiaojiang Liu , Lijie Wen , Philip S. Yu , Meng Cao

Direct Preference Optimization (DPO) has gained attention as an efficient alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with human preferences. Despite its advantages, DPO suffers…

Computation and Language · Computer Science 2025-02-21 Ruichen Shao , Bei Li , Gangao Liu , Yang Chen , Xiang Zhou , Jingang Wang , Xunliang Cai , Peng Li

In the domain of complex reasoning tasks, such as mathematical reasoning, recent advancements have proposed the use of Direct Preference Optimization (DPO) to suppress output of dispreferred responses, thereby enhancing the long-chain…

Computation and Language · Computer Science 2025-10-27 Weibin Liao , Xu Chu , Yasha Wang

Recently, there has been significant interest in replacing the reward model in Reinforcement Learning with Human Feedback (RLHF) methods for Large Language Models (LLMs), such as Direct Preference Optimization (DPO) and its variants. These…

Computation and Language · Computer Science 2024-09-27 Jian Li , Haojing Huang , Yujia Zhang , Pengfei Xu , Xi Chen , Rui Song , Lida Shi , Jingwen Wang , Hao Xu

Direct Preference Optimization (DPO) have emerged as a popular method for aligning Large Language Models (LLMs) with human preferences. While DPO effectively preserves the relative ordering between chosen and rejected responses through…

Computation and Language · Computer Science 2025-06-05 Lin Sun , Chuang Liu , Peng Liu , Bingyang Li , Weijia Lu , Ning Wu

Human preference alignment is critical in building powerful and reliable large language models (LLMs). However, current methods either ignore the multi-dimensionality of human preferences (e.g. helpfulness and harmlessness) or struggle with…

Machine Learning · Computer Science 2024-10-14 Xingzhou Lou , Junge Zhang , Jian Xie , Lifeng Liu , Dong Yan , Kaiqi Huang

Direct Preference Optimization (DPO) using an implicit reward model has proven to be an effective alternative to reinforcement learning from human feedback (RLHF) for fine-tuning preference aligned large language models (LLMs). However, the…

Computation and Language · Computer Science 2024-09-30 Guoxin Chen , Minpeng Liao , Chengxi Li , Kai Fan

Direct Preference Optimization (DPO) and its variants have become the de facto standards for aligning large language models (LLMs) with human preferences or specific goals. However, DPO requires high-quality preference data and suffers from…

Machine Learning · Computer Science 2024-11-12 Zhuotong Chen , Fang Liu , Jennifer Zhu , Wanyu Du , Yanjun Qi

Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly in their mathemat ical reasoning performance. However, GRPO and related entropy regularization methods…

Computation and Language · Computer Science 2026-04-15 Xingyu Lin , Yilin Wen , Du Su , Jinchang Hou , En Wang , Wenbin Liu , Chenfu Bao , Zhonghou Lv
‹ Prev 1 2 3 10 Next ›