English
Related papers

Related papers: AIPO: Improving Training Objective for Iterative P…

200 papers

As the era of large language models (LLMs) unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference…

In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing.…

Computation and Language · Computer Science 2025-07-18 Xinghua Zhang , Haiyang Yu , Cheng Fu , Fei Huang , Yongbin Li

Direct Preference Optimization (DPO), a standard method for aligning language models with human preferences, is traditionally applied to offline preferences. Recent studies show that DPO benefits from iterative training with online…

Computation and Language · Computer Science 2024-06-18 Jie Liu , Zhanhui Zhou , Jiaheng Liu , Xingyuan Bu , Chao Yang , Han-Sen Zhong , Wanli Ouyang

Preference optimization (PO) is indispensable for large language models (LLMs), with methods such as direct preference optimization (DPO) and proximal policy optimization (PPO) achieving great success. A common belief is that DPO is…

Machine Learning · Computer Science 2026-05-18 Yue Wang , Qizhou Wang , Zizhuo Zhang , Gang Niu , Bo Han , Masashi Sugiyama

Direct Preference Optimization (DPO) and its variants have become the de facto standards for aligning large language models (LLMs) with human preferences or specific goals. However, DPO requires high-quality preference data and suffers from…

Machine Learning · Computer Science 2024-11-12 Zhuotong Chen , Fang Liu , Jennifer Zhu , Wanyu Du , Yanjun Qi

The rapid development of large language model (LLM) alignment algorithms has resulted in a complex and fragmented landscape, with limited clarity on the effectiveness of different methods and their inter-connections. This paper introduces…

Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. While prior work mainly extends DPO from the aspect of the objective function, we instead improve DPO from…

Machine Learning · Computer Science 2026-02-17 Xun Deng , Han Zhong , Rui Ai , Fuli Feng , Zheng Wang , Xiangnan He

A common technique for aligning large language models (LLMs) relies on acquiring human preferences by comparing multiple generations conditioned on a fixed context. This method, however, relies solely on pairwise comparisons, where the…

Computation and Language · Computer Science 2025-01-09 Hritik Bansal , Ashima Suvarna , Gantavya Bhatt , Nanyun Peng , Kai-Wei Chang , Aditya Grover

Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with…

Computation and Language · Computer Science 2025-07-29 Songjun Tu , Jiahao Lin , Xiangyu Tian , Qichao Zhang , Linjing Li , Yuqian Fu , Nan Xu , Wei He , Xiangyuan Lan , Dongmei Jiang , Dongbin Zhao

Human preference alignment is essential to improve the interaction quality of large language models (LLMs). Existing alignment methods depend on manually annotated preference data to guide the LLM optimization directions. However,…

Computation and Language · Computer Science 2024-06-04 Pengyu Cheng , Yifan Yang , Jian Li , Yong Dai , Tianhao Hu , Peixin Cao , Nan Du , Xiaolong Li

Recently, test-time scaling Large Language Models (LLMs) have demonstrated exceptional reasoning capabilities across scientific and professional tasks by generating long chains-of-thought (CoT). As a crucial component for developing these…

Recently, there has been significant interest in replacing the reward model in Reinforcement Learning with Human Feedback (RLHF) methods for Large Language Models (LLMs), such as Direct Preference Optimization (DPO) and its variants. These…

Computation and Language · Computer Science 2024-09-27 Jian Li , Haojing Huang , Yujia Zhang , Pengfei Xu , Xi Chen , Rui Song , Lida Shi , Jingwen Wang , Hao Xu

Reinforcement Learning with Human Feedback (RLHF) enhances the alignment of Large Language Models (LLMs). However, its limitations have led to the development of Direct Preference Optimization (DPO), an RL-free approach designed to overcome…

Computation and Language · Computer Science 2025-02-19 Amir Saeidi , Shivanshu Verma , Aswin RRV , Kashif Rasul , Chitta Baral

Preference optimization methods typically begin training with a well-trained SFT model as a reference model. In RLHF and DPO, a regularization term is used during the preference optimization process to prevent the policy model from…

Computation and Language · Computer Science 2024-09-30 Cheolhun Jang

Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models (LLMs) with human preferences. While it enables LLMs to achieve human-level alignment, it often incurs significant…

Computation and Language · Computer Science 2025-03-21 Shivank Garg , Ayush Singh , Shweta Singh , Paras Chopra

Large Language Models (LLMs) have demonstrated remarkable potential in automating software development tasks. While recent advances leverage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align models with human…

Software Engineering · Computer Science 2025-12-09 Xin Yin , Chao Ni , Xiaohu Yang

Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns…

Computation and Language · Computer Science 2025-01-23 Yafu Li , Xuyang Hu , Xiaoye Qu , Linjie Li , Yu Cheng

Direct Preference Optimization (DPO) is a widely adopted offline algorithm for preference-based reinforcement learning from human feedback (RLHF), designed to improve training simplicity and stability by redefining reward functions.…

Computation and Language · Computer Science 2025-05-30 Gengxu Li , Tingyu Xia , Yi Chang , Yuan Wu

How can Large Language Models (LLMs) be aligned with human intentions and values? A typical solution is to gather human preference on model outputs and finetune the LLMs accordingly while ensuring that updates do not deviate too far from a…

Computation and Language · Computer Science 2024-05-28 Hung Le , Quan Tran , Dung Nguyen , Kien Do , Saloni Mittal , Kelechi Ogueji , Svetha Venkatesh

Direct Preference Optimization (DPO) has been widely adopted for large language model alignment due to its simple training procedure and lack of an explicit reward model. However, in iterative DPO, when the policy model from the previous…

Information Retrieval · Computer Science 2026-05-25 Lingling Fu , Yongfu Xu
‹ Prev 1 2 3 10 Next ›