Related papers: The Differences Between Direct Alignment Algorithm…

AlphaPO: Reward Shape Matters for LLM Alignment

Reinforcement Learning with Human Feedback (RLHF) and its variants have made huge strides toward the effective alignment of large language models (LLMs) to follow instructions and reflect human values. More recently, Direct Alignment…

Computation and Language · Computer Science 2025-06-02 Aman Gupta , Shao Tang , Qingquan Song , Sirou Zhu , Jiwoo Hong , Ankan Saha , Viral Gupta , Noah Lee , Eunki Kim , Siyu Zhu , Parag Agrawal , Natesh Pillai , S. Sathiya Keerthi

Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms

Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning…

Computation and Language · Computer Science 2026-04-17 Zeguan Xiao , Yun Chen , Guanhua Chen , Ke Tang

Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling

Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO) have emerged as alternatives to the standard Reinforcement Learning from Human Feedback (RLHF) for aligning large language models (LLMs) with human values.…

Machine Learning · Computer Science 2025-06-12 Phuc Minh Nguyen , Ngoc-Hieu Nguyen , Duy H. M. Nguyen , Anji Liu , An Mai , Binh T. Nguyen , Daniel Sonntag , Khoa D. Doan

SeRA: Self-Reviewing and Alignment of Large Language Models using Implicit Reward Margins

Direct alignment algorithms (DAAs), such as direct preference optimization (DPO), have become popular alternatives for Reinforcement Learning from Human Feedback (RLHF) due to their simplicity, efficiency, and stability. However, the…

Machine Learning · Computer Science 2024-10-15 Jongwoo Ko , Saket Dingliwal , Bhavana Ganesh , Sailik Sengupta , Sravan Bodapati , Aram Galstyan

Understanding Likelihood Over-optimisation in Direct Alignment Algorithms

Direct Alignment Algorithms (DAAs), such as Direct Preference Optimisation (DPO) and Identity Preference Optimisation (IPO), have emerged as alternatives to online Reinforcement Learning from Human Feedback (RLHF) algorithms such as…

Computation and Language · Computer Science 2024-10-21 Zhengyan Shi , Sander Land , Acyr Locatelli , Matthieu Geist , Max Bartolo

Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent success of Large Language Models (LLMs), however, it is often a complex and brittle process. In the classical RLHF framework, a reward model is first trained…

Machine Learning · Computer Science 2024-11-06 Rafael Rafailov , Yaswanth Chittepu , Ryan Park , Harshit Sikchi , Joey Hejna , Bradley Knox , Chelsea Finn , Scott Niekum

Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization

The role of reinforcement learning (RL) in enhancing the reasoning of large language models (LLMs) is becoming increasingly significant. Despite the success of RL in many scenarios, there are still many challenges in improving the reasoning…

Artificial Intelligence · Computer Science 2024-12-25 Jiacai Liu , Chaojie Wang , Chris Yuhao Liu , Liang Zeng , Rui Yan , Yiwen Sun , Yang Liu , Yahui Zhou

Direct Advantage Regression: Aligning LLMs with Online AI Reward

Online AI Feedback (OAIF) presents a promising alternative to Reinforcement Learning from Human Feedback (RLHF) by utilizing online AI preference in aligning language models (LLMs). However, the straightforward replacement of humans with AI…

Artificial Intelligence · Computer Science 2025-04-22 Li He , He Zhao , Stephen Wan , Dadong Wang , Lina Yao , Tongliang Liu

ASFT: Aligned Supervised Fine-Tuning through Absolute Likelihood

Direct Preference Optimization (DPO) is a method for enhancing model performance by directly optimizing for the preferences or rankings of outcomes, instead of traditional loss functions. This approach has proven effective in aligning Large…

Machine Learning · Computer Science 2024-09-18 Ruoyu Wang , Jiachen Sun , Shaowei Hua , Quan Fang

Learning to Align Human Code Preferences

Large Language Models (LLMs) have demonstrated remarkable potential in automating software development tasks. While recent advances leverage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align models with human…

Software Engineering · Computer Science 2025-12-09 Xin Yin , Chao Ni , Xiaohu Yang

Is On-Policy Data always the Best Choice for Direct Preference Optimization-based LM Alignment?

The alignment of language models~(LMs) with human preferences is critical for building reliable AI systems. The problem is typically framed as optimizing an LM policy to maximize the expected reward that reflects human preferences.…

Artificial Intelligence · Computer Science 2026-01-28 Zetian Sun , Dongfang Li , Xuhui Chen , Baotian Hu , Min Zhang

Listwise Direct Preference Optimization with Multi-Dimensional Preference Mixing

Recent alignment methods based on Direct Preference Optimization (DPO) reformulate preference learning as supervised optimization over pairwise comparisons, offering improved efficiency and stability over reinforcement learning from human…

Machine Learning · Computer Science 2026-01-22 Yuhui Sun , Xiyao Wang , Zixi Li , YiTian Ding , Tianyang Ling , Jialuo Chen , Tianyi Yu , Zhenlong Yuan , Jinman Zhao

Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections

Post-training processes are essential phases in grounding pre-trained language models to real-world tasks, with learning from demonstrations or preference signals playing a crucial role in this adaptation. We present a unified theoretical…

Machine Learning · Computer Science 2025-07-08 Bo Wang , Qinyuan Cheng , Runyu Peng , Rong Bao , Peiji Li , Qipeng Guo , Linyang Li , Zhiyuan Zeng , Yunhua Zhou , Xipeng Qiu

Why DPO is a Misspecified Estimator and How to Fix It

Direct alignment algorithms such as Direct Preference Optimization (DPO) fine-tune models based on preference data, using only supervised learning instead of two-stage reinforcement learning with human feedback (RLHF). We show that DPO…

Machine Learning · Computer Science 2025-10-24 Aditya Gopalan , Sayak Ray Chowdhury , Debangshu Banerjee

A Generic First-Order Algorithmic Framework for Bi-Level Programming Beyond Lower-Level Singleton

In recent years, a variety of gradient-based first-order methods have been developed to solve bi-level optimization problems for learning applications. However, theoretical guarantees of these existing approaches heavily rely on the…

Machine Learning · Computer Science 2020-07-03 Risheng Liu , Pan Mu , Xiaoming Yuan , Shangzhi Zeng , Jin Zhang

Direct Language Model Alignment from Online AI Feedback

Direct alignment from preferences (DAP) methods, such as DPO, have recently emerged as efficient alternatives to reinforcement learning from human feedback (RLHF), that do not require a separate reward model. However, the preference…

Artificial Intelligence · Computer Science 2024-03-04 Shangmin Guo , Biao Zhang , Tianlin Liu , Tianqi Liu , Misha Khalman , Felipe Llinares , Alexandre Rame , Thomas Mesnard , Yao Zhao , Bilal Piot , Johan Ferret , Mathieu Blondel

Stabilizing Efficient Reasoning with Step-Level Advantage Selection

Large language models (LLMs) achieve strong reasoning performance by allocating substantial computation at inference time, often generating long and verbose reasoning traces. While recent work on efficient reasoning reduces this overhead…

Computation and Language · Computer Science 2026-04-28 Han Wang , Xiaodong Yu , Jialian Wu , Jiang Liu , Ximeng Sun , Mohit Bansal , Zicheng Liu

Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

This study evaluates Direct Preference Optimization (DPO) and its variants for aligning Large Language Models (LLMs) with human preferences, testing three configurations: (1) with Supervised Fine Tuning (SFT), (2) without SFT, and (3)…

Computation and Language · Computer Science 2025-02-11 Amir Saeidi , Shivanshu Verma , Md Nayem Uddin , Chitta Baral

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

Preference alignment in Large Language Models (LLMs) has significantly improved their ability to adhere to human instructions and intentions. However, existing direct alignment algorithms primarily focus on relative preferences and often…

Machine Learning · Computer Science 2025-05-13 Shenao Zhang , Zhihan Liu , Boyi Liu , Yufeng Zhang , Yingxiang Yang , Yongfei Liu , Liyu Chen , Tao Sun , Zhaoran Wang

Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both

Traditional RLHF-based LLM alignment methods explicitly maximize the expected rewards from a separate reward model. More recent supervised alignment methods like Direct Preference Optimization (DPO) circumvent this phase to avoid problems…

Machine Learning · Computer Science 2025-02-03 Abhijnan Nath , Changsoo Jung , Ethan Seefried , Nikhil Krishnaswamy