Related papers: AIPO: Improving Training Objective for Iterative P…

MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

As the era of large language models (LLMs) unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference…

Machine Learning · Computer Science 2026-05-11 Guangchen Lan , Sipeng Zhang , Tianle Wang , Yuwei Zhang , Daoan Zhang , Xinpeng Wei , Xiaoman Pan , Hongming Zhang , Dong-Jun Han , Christopher G. Brinton

IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing.…

Computation and Language · Computer Science 2025-07-18 Xinghua Zhang , Haiyang Yu , Cheng Fu , Fei Huang , Yongbin Li

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

Direct Preference Optimization (DPO), a standard method for aligning language models with human preferences, is traditionally applied to offline preferences. Recent studies show that DPO benefits from iterative training with online…

Computation and Language · Computer Science 2024-06-18 Jie Liu , Zhanhui Zhou , Jiaheng Liu , Xingyuan Bu , Chao Yang , Han-Sen Zhong , Wanli Ouyang

What Is Preference Optimization Doing, and Why?

Preference optimization (PO) is indispensable for large language models (LLMs), with methods such as direct preference optimization (DPO) and proximal policy optimization (PPO) achieving great success. A common belief is that DPO is…

Machine Learning · Computer Science 2026-05-18 Yue Wang , Qizhou Wang , Zizhuo Zhang , Gang Niu , Bo Han , Masashi Sugiyama

Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization

Direct Preference Optimization (DPO) and its variants have become the de facto standards for aligning large language models (LLMs) with human preferences or specific goals. However, DPO requires high-quality preference data and suffers from…

Machine Learning · Computer Science 2024-11-12 Zhuotong Chen , Fang Liu , Jennifer Zhu , Wanyu Du , Yanjun Qi

Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment

The rapid development of large language model (LLM) alignment algorithms has resulted in a complex and fragmented landscape, with limited clarity on the effectiveness of different methods and their inter-connections. This paper introduces…

Machine Learning · Computer Science 2025-02-11 Shengyang Sun , Yian Zhang , Alexander Bukharin , David Mosallanezhad , Jiaqi Zeng , Soumye Singhal , Gerald Shen , Adithya Renduchintala , Tugrul Konuk , Yi Dong , Zhilin Wang , Dmitry Chichkov , Olivier Delalleau , Oleksii Kuchaiev

Less is More: Improving LLM Alignment via Preference Data Selection

Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. While prior work mainly extends DPO from the aspect of the objective function, we instead improve DPO from…

Machine Learning · Computer Science 2026-02-17 Xun Deng , Han Zhong , Rui Ai , Fuli Feng , Zheng Wang , Xiangnan He

Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization

A common technique for aligning large language models (LLMs) relies on acquiring human preferences by comparing multiple generations conditioned on a fixed context. This method, however, relies solely on pairwise comparisons, where the…

Computation and Language · Computer Science 2025-01-09 Hritik Bansal , Ashima Suvarna , Gantavya Bhatt , Nanyun Peng , Kai-Wei Chang , Aditya Grover

Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with…

Computation and Language · Computer Science 2025-07-29 Songjun Tu , Jiahao Lin , Xiangyu Tian , Qichao Zhang , Linjing Li , Yuqian Fu , Nan Xu , Wei He , Xiangyuan Lan , Dongmei Jiang , Dongbin Zhao

Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game

Human preference alignment is essential to improve the interaction quality of large language models (LLMs). Existing alignment methods depend on manually annotated preference data to guide the LLM optimization directions. However,…

Computation and Language · Computer Science 2024-06-04 Pengyu Cheng , Yifan Yang , Jian Li , Yong Dai , Tianhao Hu , Peixin Cao , Nan Du , Xiaolong Li

Truncated Proximal Policy Optimization

Recently, test-time scaling Large Language Models (LLMs) have demonstrated exceptional reasoning capabilities across scientific and professional tasks by generating long chains-of-thought (CoT). As a crucial component for developing these…

Artificial Intelligence · Computer Science 2025-06-19 Tiantian Fan , Lingjun Liu , Yu Yue , Jiaze Chen , Chengyi Wang , Qiying Yu , Chi Zhang , Zhiqi Lin , Ruofei Zhu , Yufeng Yuan , Xiaochen Zuo , Bole Ma , Mofan Zhang , Gaohong Liu , Ru Zhang , Haotian Zhou , Cong Xie , Ruidong Zhu , Zhi Zhang , Xin Liu , Mingxuan Wang , Lin Yan , Yonghui Wu

Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness

Recently, there has been significant interest in replacing the reward model in Reinforcement Learning with Human Feedback (RLHF) methods for Large Language Models (LLMs), such as Direct Preference Optimization (DPO) and its variants. These…

Computation and Language · Computer Science 2024-09-27 Jian Li , Haojing Huang , Yujia Zhang , Pengfei Xu , Xi Chen , Rui Song , Lida Shi , Jingwen Wang , Hao Xu

Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization

Reinforcement Learning with Human Feedback (RLHF) enhances the alignment of Large Language Models (LLMs). However, its limitations have led to the development of Direct Preference Optimization (DPO), an RL-free approach designed to overcome…

Computation and Language · Computer Science 2025-02-19 Amir Saeidi , Shivanshu Verma , Aswin RRV , Kashif Rasul , Chitta Baral

Modulated Intervention Preference Optimization (MIPO): Keep the Easy, Refine the Difficult

Preference optimization methods typically begin training with a well-trained SFT model as a reference model. In RLHF and DPO, a regularization term is used during the preference optimization process to prevent the policy model from…

Computation and Language · Computer Science 2024-09-30 Cheolhun Jang

IPO: Your Language Model is Secretly a Preference Classifier

Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models (LLMs) with human preferences. While it enables LLMs to achieve human-level alignment, it often incurs significant…

Computation and Language · Computer Science 2025-03-21 Shivank Garg , Ayush Singh , Shweta Singh , Paras Chopra

Learning to Align Human Code Preferences

Large Language Models (LLMs) have demonstrated remarkable potential in automating software development tasks. While recent advances leverage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align models with human…

Software Engineering · Computer Science 2025-12-09 Xin Yin , Chao Ni , Xiaohu Yang

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns…

Computation and Language · Computer Science 2025-01-23 Yafu Li , Xuyang Hu , Xiaoye Qu , Linjie Li , Yu Cheng

Length-Controlled Margin-Based Preference Optimization without Reference Model

Direct Preference Optimization (DPO) is a widely adopted offline algorithm for preference-based reinforcement learning from human feedback (RLHF), designed to improve training simplicity and stability by redefining reward functions.…

Computation and Language · Computer Science 2025-05-30 Gengxu Li , Tingyu Xia , Yi Chang , Yuan Wu

Multi-Reference Preference Optimization for Large Language Models

How can Large Language Models (LLMs) be aligned with human intentions and values? A typical solution is to gather human preference on model outputs and finetune the LLMs accordingly while ensuring that updates do not deviate too far from a…

Computation and Language · Computer Science 2024-05-28 Hung Le , Quan Tran , Dung Nguyen , Kien Do , Saloni Mittal , Kelechi Ogueji , Svetha Venkatesh

TPMM-DPO: Trajectory-aware Preference-guided Model Merging for Iterative Direct Preference Optimization

Direct Preference Optimization (DPO) has been widely adopted for large language model alignment due to its simple training procedure and lack of an explicit reward model. However, in iterative DPO, when the policy model from the previous…

Information Retrieval · Computer Science 2026-05-25 Lingling Fu , Yongfu Xu