Related papers: AMPO: Active Multi-Preference Optimization for Sel…

AMoPO: Adaptive Multi-objective Preference Optimization without Reward Models and Reference Models

Existing multi-objective preference alignment methods for large language models (LLMs) face limitations: (1) the inability to effectively balance various preference dimensions, and (2) reliance on auxiliary reward/reference models…

Machine Learning · Computer Science 2025-06-10 Qi Liu , Jingqing Ruan , Hao Li , Haodong Zhao , Desheng Wang , Jiansong Chen , Wan Guanglu , Xunliang Cai , Zhi Zheng , Tong Xu

Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game

Human preference alignment is essential to improve the interaction quality of large language models (LLMs). Existing alignment methods depend on manually annotated preference data to guide the LLM optimization directions. However,…

Computation and Language · Computer Science 2024-06-04 Pengyu Cheng , Yifan Yang , Jian Li , Yong Dai , Tianhao Hu , Peixin Cao , Nan Du , Xiaolong Li

AMaPO: Adaptive Margin-attached Preference Optimization for Language Model Alignment

Offline preference optimization offers a simpler and more stable alternative to RLHF for aligning language models. However, their effectiveness is critically dependent on ranking accuracy, a metric where further gains are highly impactful.…

Computation and Language · Computer Science 2025-11-18 Ruibo Deng , Duanyu Feng , Wenqiang Lei

CAPO: Confidence Aware Preference Optimization Learning for Multilingual Preferences

Preference optimization is a critical post-training technique used to align large language models (LLMs) with human preferences, typically by fine-tuning on ranked response pairs. While methods like Direct Preference Optimization (DPO) have…

Computation and Language · Computer Science 2025-11-12 Rhitabrat Pokharel , Yufei Tao , Ameeta Agrawal

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

In the field of large language models (LLMs), aligning models with the diverse preferences of users is a critical challenge. Direct Preference Optimization (DPO) has played a key role in this area. It works by using pairs of preferences…

Computation and Language · Computer Science 2024-05-29 Yueqin Yin , Zhendong Wang , Yi Gu , Hai Huang , Weizhu Chen , Mingyuan Zhou

AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization

Aligning large language models (LLMs) with human values and intentions is crucial for their utility, honesty, and safety. Reinforcement learning from human feedback (RLHF) is a popular approach to achieve this alignment, but it faces…

Machine Learning · Computer Science 2025-07-22 Junkang Wu , Xue Wang , Zhengyi Yang , Jiancan Wu , Jinyang Gao , Bolin Ding , Xiang Wang , Xiangnan He

Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment

Traditional language model alignment methods, such as Direct Preference Optimization (DPO), are limited by their dependence on static, pre-collected paired preference data, which hampers their adaptability and practical applicability. To…

Computation and Language · Computer Science 2024-06-03 Yueqin Yin , Zhendong Wang , Yujia Xie , Weizhu Chen , Mingyuan Zhou

ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

The recent success in using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks, such as question answering, mathematical reasoning, and code generation. However,…

Machine Learning · Computer Science 2026-05-18 Xiaoqiang Lin , Arun Verma , Zhongxiang Dai , Daniela Rus , See-Kiong Ng , Bryan Kian Hsiang Low

ASPO: Adaptive Sentence-Level Preference Optimization for Fine-Grained Multimodal Reasoning

Direct Preference Optimization (DPO) has gained significant attention for its simplicity and computational efficiency in aligning large language models (LLMs). Recent advancements have extended DPO to multimodal scenarios, achieving strong…

Computation and Language · Computer Science 2025-05-27 Yeyuan Wang , Dehong Gao , Rujiao Long , Lei Yi , Linbo Jin , Libin Yang , Xiaoyan Cai

Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment

The rapid development of large language model (LLM) alignment algorithms has resulted in a complex and fragmented landscape, with limited clarity on the effectiveness of different methods and their inter-connections. This paper introduces…

Machine Learning · Computer Science 2025-02-11 Shengyang Sun , Yian Zhang , Alexander Bukharin , David Mosallanezhad , Jiaqi Zeng , Soumye Singhal , Gerald Shen , Adithya Renduchintala , Tugrul Konuk , Yi Dong , Zhilin Wang , Dmitry Chichkov , Olivier Delalleau , Oleksii Kuchaiev

$i$REPO: $i$mplicit Reward Pairwise Difference based Empirical Preference Optimization

While astonishingly capable, large Language Models (LLM) can sometimes produce outputs that deviate from human expectations. Such deviations necessitate an alignment phase to prevent disseminating untruthful, toxic, or biased information.…

Artificial Intelligence · Computer Science 2024-10-30 Long Tan Le , Han Shu , Tung-Anh Nguyen , Choong Seon Hong , Nguyen H. Tran

MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

As the era of large language models (LLMs) unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference…

Machine Learning · Computer Science 2026-05-11 Guangchen Lan , Sipeng Zhang , Tianle Wang , Yuwei Zhang , Daoan Zhang , Xinpeng Wei , Xiaoman Pan , Hongming Zhang , Dong-Jun Han , Christopher G. Brinton

Beyond Pairwise: Empowering LLM Alignment With Ranked Choice Modeling

Alignment of large language models (LLMs) has predominantly relied on pairwise preference optimization, where annotators select the better of two responses to a prompt. While simple, this approach overlooks the opportunity to learn from…

Machine Learning · Computer Science 2026-02-11 Yuxuan Tang , Yifan Feng

Multi-Preference Optimization: Generalizing DPO via Set-Level Contrasts

Direct Preference Optimization (DPO) has become a popular approach for aligning language models using pairwise preferences. However, in practical post-training pipelines, on-policy generation typically yields multiple candidate responses…

Machine Learning · Computer Science 2025-06-23 Taneesh Gupta , Rahul Madhavan , Xuchao Zhang , Nagarajan Natarajan , Chetan Bansal , Saravan Rajmohan

Multi-Objective Preference Optimization: Improving Human Alignment of Generative Models

Post-training of LLMs with RLHF, and subsequently preference optimization algorithms such as DPO, IPO, etc., made a big difference in improving human alignment. However, all such techniques can only work with a single (human) objective. In…

Machine Learning · Computer Science 2025-05-19 Akhil Agnihotri , Rahul Jain , Deepak Ramachandran , Zheng Wen

Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback

Large language models (LLMs) fine-tuned with alignment techniques, such as reinforcement learning from human feedback, have been instrumental in developing some of the most capable AI systems to date. Despite their success, existing methods…

Computation and Language · Computer Science 2025-07-01 Kyuyoung Kim , Ah Jeong Seo , Hao Liu , Jinwoo Shin , Kimin Lee

MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples

Aligning Large Language Models (LLMs) with human feedback is crucial for their development. Existing preference optimization methods such as DPO and KTO, while improved based on Reinforcement Learning from Human Feedback (RLHF), are…

Computation and Language · Computer Science 2024-12-23 Shuo Xie , Fangzhi Zhu , Jiahui Wang , Lulu Wen , Wei Dai , Xiaowei Chen , Junxiong Zhu , Kai Zhou , Bo Zheng

Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining

The task adaptation and alignment of Large Multimodal Models (LMMs) have been significantly advanced by instruction tuning and further strengthened by recent preference optimization. Yet, most LMMs still suffer from severe modality…

Machine Learning · Computer Science 2025-10-10 Chenxi Liu , Tianyi Xiong , Yanshuo Chen , Ruibo Chen , Yihan Wu , Junfeng Guo , Tianyi Zhou , Heng Huang

Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment

Direct Preference Optimization (DPO) has become a prominent method for aligning Large Language Models (LLMs) with human preferences. While DPO has enabled significant progress in aligning English LLMs, multilingual preference alignment is…

Computation and Language · Computer Science 2025-06-06 Wen Yang , Junhong Wu , Chen Wang , Chengqing Zong , Jiajun Zhang

Learning to Align Human Code Preferences

Large Language Models (LLMs) have demonstrated remarkable potential in automating software development tasks. While recent advances leverage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align models with human…

Software Engineering · Computer Science 2025-12-09 Xin Yin , Chao Ni , Xiaohu Yang