English
Related papers

Related papers: Token-level Direct Preference Optimization

200 papers

Aligning Large Language Models (LLMs) with human preferences is crucial for safe and effective AI interactions. While popular methods like Direct Preference Optimization (DPO) have simplified alignment, they remain sensitive to data noise…

Artificial Intelligence · Computer Science 2026-03-03 Ning Yang , Hai Lin , Yibo Liu , Baoliang Tian , Guoqing Liu , Haijun Zhang

Direct Preference Optimization (DPO) aligns Large Language Models with human preferences without the need for a separate reward model. However, DPO treats all tokens in responses equally, neglecting the differing importance of individual…

Computation and Language · Computer Science 2026-05-27 Chengyu Huang , Zhuohang Li , Sheng-Yen Chou , Claire Cardie

Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing…

Computation and Language · Computer Science 2026-05-15 Truong Nguyen , Tien-Phat Nguyen , Linh Ngo Van , Duy Minh Ho Nguyen , Khoa D. Doan , Trung Le

Large Language Models (LLMs) have demonstrated unprecedented generative capabilities, yet their alignment with human values remains critical for ensuring helpful and harmless deployments. While Reinforcement Learning from Human Feedback…

Recent advancements in reinforcement learning from human feedback have shown that utilizing fine-grained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO) in aligning large language…

Machine Learning · Computer Science 2025-06-18 Mingkang Zhu , Xi Chen , Zhongdao Wang , Bei Yu , Hengshuang Zhao , Jiaya Jia

The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but concurrently amplify safety concerns, such as potential misuse of AI systems, necessitating effective AI alignment.…

Machine Learning · Computer Science 2023-09-29 Chaoqi Wang , Yibo Jiang , Chenghao Yang , Han Liu , Yuxin Chen

In the domain of complex reasoning tasks, such as mathematical reasoning, recent advancements have proposed the use of Direct Preference Optimization (DPO) to suppress output of dispreferred responses, thereby enhancing the long-chain…

Computation and Language · Computer Science 2025-10-27 Weibin Liao , Xu Chu , Yasha Wang

Direct Preference Optimization (DPO) has been widely adopted for preference alignment of Large Language Models (LLMs) due to its simplicity and effectiveness. However, DPO is derived as a bandit problem in which the whole response is…

Computation and Language · Computer Science 2025-04-16 Aiwei Liu , Haoping Bai , Zhiyun Lu , Yanchao Sun , Xiang Kong , Simon Wang , Jiulong Shan , Albin Madappally Jose , Xiaojiang Liu , Lijie Wen , Philip S. Yu , Meng Cao

Large Vision-Language Models (LVLMs) or multimodal large language models represent a significant advancement in artificial intelligence, enabling systems to understand and generate content across both visual and textual modalities. While…

Machine Learning · Computer Science 2025-09-09 Thanh Thi Nguyen , Campbell Wilson , Janis Dalins

Direct Preference Optimization (DPO) has gained attention as an efficient alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with human preferences. Despite its advantages, DPO suffers…

Computation and Language · Computer Science 2025-02-21 Ruichen Shao , Bei Li , Gangao Liu , Yang Chen , Xiang Zhou , Jingang Wang , Xunliang Cai , Peng Li

Direct Preference Optimization (DPO) has been demonstrated to be highly effective in mitigating hallucinations in Large Vision Language Models (LVLMs) by aligning their outputs more closely with human preferences. Despite the recent…

Computer Vision and Pattern Recognition · Computer Science 2025-09-24 Jihao Gu , Yingyao Wang , Meng Cao , Pi Bu , Jun Song , Yancheng He , Shilong Li , Bo Zheng

In the field of large language models (LLMs), aligning models with the diverse preferences of users is a critical challenge. Direct Preference Optimization (DPO) has played a key role in this area. It works by using pairs of preferences…

Computation and Language · Computer Science 2024-05-29 Yueqin Yin , Zhendong Wang , Yi Gu , Hai Huang , Weizhu Chen , Mingyuan Zhou

Traditional RLHF-based LLM alignment methods explicitly maximize the expected rewards from a separate reward model. More recent supervised alignment methods like Direct Preference Optimization (DPO) circumvent this phase to avoid problems…

Machine Learning · Computer Science 2025-02-03 Abhijnan Nath , Changsoo Jung , Ethan Seefried , Nikhil Krishnaswamy

The last year has witnessed the rapid progress of large language models (LLMs) across diverse domains. Among them, CodeLLMs have garnered particular attention because they can not only assist in completing various programming tasks but also…

Artificial Intelligence · Computer Science 2024-10-25 Yibo Miao , Bofei Gao , Shanghaoran Quan , Junyang Lin , Daoguang Zan , Jiaheng Liu , Jian Yang , Tianyu Liu , Zhijie Deng

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an…

Artificial Intelligence · Computer Science 2025-07-15 Wenyi Xiao , Zechuan Wang , Leilei Gan , Shuai Zhao , Zongrui Li , Ruirui Lei , Wanggui He , Luu Anh Tuan , Long Chen , Hao Jiang , Zhou Zhao , Fei Wu

How can Large Language Models (LLMs) be aligned with human intentions and values? A typical solution is to gather human preference on model outputs and finetune the LLMs accordingly while ensuring that updates do not deviate too far from a…

Computation and Language · Computer Science 2024-05-28 Hung Le , Quan Tran , Dung Nguyen , Kien Do , Saloni Mittal , Kelechi Ogueji , Svetha Venkatesh

Large Language Models (LLMs) have become increasingly popular due to their ability to process and generate natural language. However, as they are trained on massive datasets of text, LLMs can inherit harmful biases and produce outputs that…

Computation and Language · Computer Science 2025-01-23 Qi Gou , Cam-Tu Nguyen

Direct Preference Optimization (DPO) has emerged as a prominent algorithm for the direct and robust alignment of Large Language Models (LLMs) with human preferences, offering a more straightforward alternative to the complex Reinforcement…

Computation and Language · Computer Science 2024-12-10 Junru Lu , Jiazheng Li , Siyu An , Meng Zhao , Yulan He , Di Yin , Xing Sun

Our goal is to enable large language models (LLMs) to balance multiple human preference dimensions; such as helpfulness, safety, and verbosity, through principled and controllable alignment. Existing preference optimization methods,…

Machine Learning · Computer Science 2026-02-03 Mete Erdogan

Large language models (LLMs) demonstrate strong multilingual capabilities, yet often fail to consistently generate responses in the intended language, exhibiting a phenomenon known as language confusion. Prior mitigation approaches based on…

Computation and Language · Computer Science 2026-04-30 Jinho Choo , JunSeung Lee , Jimyeong Kim , Yeeho Song , S. K. Hong , Yeong-Dae Kwon
‹ Prev 1 2 3 10 Next ›