English
Related papers

Related papers: The Differences Between Direct Alignment Algorithm…

200 papers

Reinforcement Learning with Human Feedback (RLHF) and its variants have made huge strides toward the effective alignment of large language models (LLMs) to follow instructions and reflect human values. More recently, Direct Alignment…

Computation and Language · Computer Science 2025-06-02 Aman Gupta , Shao Tang , Qingquan Song , Sirou Zhu , Jiwoo Hong , Ankan Saha , Viral Gupta , Noah Lee , Eunki Kim , Siyu Zhu , Parag Agrawal , Natesh Pillai , S. Sathiya Keerthi

Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning…

Computation and Language · Computer Science 2026-04-17 Zeguan Xiao , Yun Chen , Guanhua Chen , Ke Tang

Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO) have emerged as alternatives to the standard Reinforcement Learning from Human Feedback (RLHF) for aligning large language models (LLMs) with human values.…

Machine Learning · Computer Science 2025-06-12 Phuc Minh Nguyen , Ngoc-Hieu Nguyen , Duy H. M. Nguyen , Anji Liu , An Mai , Binh T. Nguyen , Daniel Sonntag , Khoa D. Doan

Direct alignment algorithms (DAAs), such as direct preference optimization (DPO), have become popular alternatives for Reinforcement Learning from Human Feedback (RLHF) due to their simplicity, efficiency, and stability. However, the…

Machine Learning · Computer Science 2024-10-15 Jongwoo Ko , Saket Dingliwal , Bhavana Ganesh , Sailik Sengupta , Sravan Bodapati , Aram Galstyan

Direct Alignment Algorithms (DAAs), such as Direct Preference Optimisation (DPO) and Identity Preference Optimisation (IPO), have emerged as alternatives to online Reinforcement Learning from Human Feedback (RLHF) algorithms such as…

Computation and Language · Computer Science 2024-10-21 Zhengyan Shi , Sander Land , Acyr Locatelli , Matthieu Geist , Max Bartolo

Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent success of Large Language Models (LLMs), however, it is often a complex and brittle process. In the classical RLHF framework, a reward model is first trained…

Machine Learning · Computer Science 2024-11-06 Rafael Rafailov , Yaswanth Chittepu , Ryan Park , Harshit Sikchi , Joey Hejna , Bradley Knox , Chelsea Finn , Scott Niekum

The role of reinforcement learning (RL) in enhancing the reasoning of large language models (LLMs) is becoming increasingly significant. Despite the success of RL in many scenarios, there are still many challenges in improving the reasoning…

Artificial Intelligence · Computer Science 2024-12-25 Jiacai Liu , Chaojie Wang , Chris Yuhao Liu , Liang Zeng , Rui Yan , Yiwen Sun , Yang Liu , Yahui Zhou

Online AI Feedback (OAIF) presents a promising alternative to Reinforcement Learning from Human Feedback (RLHF) by utilizing online AI preference in aligning language models (LLMs). However, the straightforward replacement of humans with AI…

Artificial Intelligence · Computer Science 2025-04-22 Li He , He Zhao , Stephen Wan , Dadong Wang , Lina Yao , Tongliang Liu

Direct Preference Optimization (DPO) is a method for enhancing model performance by directly optimizing for the preferences or rankings of outcomes, instead of traditional loss functions. This approach has proven effective in aligning Large…

Machine Learning · Computer Science 2024-09-18 Ruoyu Wang , Jiachen Sun , Shaowei Hua , Quan Fang

Large Language Models (LLMs) have demonstrated remarkable potential in automating software development tasks. While recent advances leverage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align models with human…

Software Engineering · Computer Science 2025-12-09 Xin Yin , Chao Ni , Xiaohu Yang

The alignment of language models~(LMs) with human preferences is critical for building reliable AI systems. The problem is typically framed as optimizing an LM policy to maximize the expected reward that reflects human preferences.…

Artificial Intelligence · Computer Science 2026-01-28 Zetian Sun , Dongfang Li , Xuhui Chen , Baotian Hu , Min Zhang

Recent alignment methods based on Direct Preference Optimization (DPO) reformulate preference learning as supervised optimization over pairwise comparisons, offering improved efficiency and stability over reinforcement learning from human…

Machine Learning · Computer Science 2026-01-22 Yuhui Sun , Xiyao Wang , Zixi Li , YiTian Ding , Tianyang Ling , Jialuo Chen , Tianyi Yu , Zhenlong Yuan , Jinman Zhao

Post-training processes are essential phases in grounding pre-trained language models to real-world tasks, with learning from demonstrations or preference signals playing a crucial role in this adaptation. We present a unified theoretical…

Machine Learning · Computer Science 2025-07-08 Bo Wang , Qinyuan Cheng , Runyu Peng , Rong Bao , Peiji Li , Qipeng Guo , Linyang Li , Zhiyuan Zeng , Yunhua Zhou , Xipeng Qiu

Direct alignment algorithms such as Direct Preference Optimization (DPO) fine-tune models based on preference data, using only supervised learning instead of two-stage reinforcement learning with human feedback (RLHF). We show that DPO…

Machine Learning · Computer Science 2025-10-24 Aditya Gopalan , Sayak Ray Chowdhury , Debangshu Banerjee

In recent years, a variety of gradient-based first-order methods have been developed to solve bi-level optimization problems for learning applications. However, theoretical guarantees of these existing approaches heavily rely on the…

Machine Learning · Computer Science 2020-07-03 Risheng Liu , Pan Mu , Xiaoming Yuan , Shangzhi Zeng , Jin Zhang

Direct alignment from preferences (DAP) methods, such as DPO, have recently emerged as efficient alternatives to reinforcement learning from human feedback (RLHF), that do not require a separate reward model. However, the preference…

Large language models (LLMs) achieve strong reasoning performance by allocating substantial computation at inference time, often generating long and verbose reasoning traces. While recent work on efficient reasoning reduces this overhead…

Computation and Language · Computer Science 2026-04-28 Han Wang , Xiaodong Yu , Jialian Wu , Jiang Liu , Ximeng Sun , Mohit Bansal , Zicheng Liu

This study evaluates Direct Preference Optimization (DPO) and its variants for aligning Large Language Models (LLMs) with human preferences, testing three configurations: (1) with Supervised Fine Tuning (SFT), (2) without SFT, and (3)…

Computation and Language · Computer Science 2025-02-11 Amir Saeidi , Shivanshu Verma , Md Nayem Uddin , Chitta Baral

Preference alignment in Large Language Models (LLMs) has significantly improved their ability to adhere to human instructions and intentions. However, existing direct alignment algorithms primarily focus on relative preferences and often…

Machine Learning · Computer Science 2025-05-13 Shenao Zhang , Zhihan Liu , Boyi Liu , Yufeng Zhang , Yingxiang Yang , Yongfei Liu , Liyu Chen , Tao Sun , Zhaoran Wang

Traditional RLHF-based LLM alignment methods explicitly maximize the expected rewards from a separate reward model. More recent supervised alignment methods like Direct Preference Optimization (DPO) circumvent this phase to avoid problems…

Machine Learning · Computer Science 2025-02-03 Abhijnan Nath , Changsoo Jung , Ethan Seefried , Nikhil Krishnaswamy
‹ Prev 1 2 3 10 Next ›