Related papers: ORPO: Monolithic Preference Optimization without R…

ASFT: Aligned Supervised Fine-Tuning through Absolute Likelihood

Direct Preference Optimization (DPO) is a method for enhancing model performance by directly optimizing for the preferences or rankings of outcomes, instead of traditional loss functions. This approach has proven effective in aligning Large…

Machine Learning · Computer Science 2024-09-18 Ruoyu Wang , Jiachen Sun , Shaowei Hua , Quan Fang

Small-Margin Preferences Still Matter-If You Train Them Right

Preference optimization methods such as DPO align large language models (LLMs) using paired comparisons, but their effectiveness can be highly sensitive to the quality and difficulty of preference pairs. A common heuristic treats…

Artificial Intelligence · Computer Science 2026-02-03 Jinlong Pang , Zhaowei Zhu , Na Di , Yichi Zhang , Yaxuan Wang , Chen Qian , Yang Liu

FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings

Efficient preference optimization algorithms such as Direct Preference Optimization (DPO) have become a popular approach in aligning large language models (LLMs) with human preferences. These algorithms implicitly treat the LLM as a reward…

Computation and Language · Computer Science 2025-07-29 Tong Liu , Xiao Yu , Wenxuan Zhou , Jindong Gu , Volker Tresp

M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following

Large Vision-Language Models (LVLMs) hold immense potential for complex multimodal instruction following, yet their development is often hindered by the high cost and inconsistency of human annotation required for effective fine-tuning and…

Computation and Language · Computer Science 2025-08-19 Ruirui Gao , Emily Johnson , Bowen Tan , Yanfei Qian

AMaPO: Adaptive Margin-attached Preference Optimization for Language Model Alignment

Offline preference optimization offers a simpler and more stable alternative to RLHF for aligning language models. However, their effectiveness is critically dependent on ranking accuracy, a metric where further gains are highly impactful.…

Computation and Language · Computer Science 2025-11-18 Ruibo Deng , Duanyu Feng , Wenqiang Lei

An Empirical Study of SFT-DPO Interaction and Parameterization in Small Language Models

Direct Preference Optimization (DPO) is widely used after supervised fine-tuning (SFT) to align language models, yet empirical behavior under small backbones and modest data is under-specified. We systematically compare SFT-only, DPO-only,…

Computation and Language · Computer Science 2026-03-23 Yuming Feng , Christy Yang

ROPO: Robust Preference Optimization for Large Language Models

Preference alignment is pivotal for empowering large language models (LLMs) to generate helpful and harmless responses. However, the performance of preference alignment is highly sensitive to the prevalent noise in the preference data.…

Machine Learning · Computer Science 2024-05-29 Xize Liang , Chao Chen , Shuang Qiu , Jie Wang , Yue Wu , Zhihang Fu , Zhihao Shi , Feng Wu , Jieping Ye

Learning to Align Human Code Preferences

Large Language Models (LLMs) have demonstrated remarkable potential in automating software development tasks. While recent advances leverage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align models with human…

Software Engineering · Computer Science 2025-12-09 Xin Yin , Chao Ni , Xiaohu Yang

Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over Aligned Large Language Models

Alignment, endowing a pre-trained Large language model (LLM) with the ability to follow instructions, is crucial for its real-world applications. Conventional supervised fine-tuning (SFT) methods formalize it as causal language modeling…

Computation and Language · Computer Science 2024-12-18 Yuchen Fan , Yuzhong Hong , Qiushi Wang , Junwei Bao , Hongfei Jiang , Yang Song

Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback

Large language models (LLMs) fine-tuned with alignment techniques, such as reinforcement learning from human feedback, have been instrumental in developing some of the most capable AI systems to date. Despite their success, existing methods…

Computation and Language · Computer Science 2025-07-01 Kyuyoung Kim , Ah Jeong Seo , Hao Liu , Jinwoo Shin , Kimin Lee

Offline Preference Optimization via Maximum Marginal Likelihood Estimation

Aligning Large Language Models (LLMs) with human preferences is crucial, but standard methods like Reinforcement Learning from Human Feedback (RLHF) are often complex and unstable. In this work, we propose a new, simpler approach that…

Machine Learning · Computer Science 2026-01-27 Saeed Najafi , Alona Fyshe

Improving LLM Safety and Helpfulness using SFT and DPO: A Study on OPT-350M

This research investigates the effectiveness of alignment techniques, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and a combined SFT+DPO approach on improving the safety and helpfulness of the OPT-350M language…

Computation and Language · Computer Science 2025-09-12 Piyush Pant

Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment

Traditional language model alignment methods, such as Direct Preference Optimization (DPO), are limited by their dependence on static, pre-collected paired preference data, which hampers their adaptability and practical applicability. To…

Computation and Language · Computer Science 2024-06-03 Yueqin Yin , Zhendong Wang , Yujia Xie , Weizhu Chen , Mingyuan Zhou

Semi-Supervised Preference Optimization with Limited Feedback

The field of preference optimization has made outstanding contributions to the alignment of language models with human preferences. Despite these advancements, recent methods still rely heavily on substantial paired (labeled) feedback data,…

Machine Learning · Computer Science 2026-02-20 Seonggyun Lee , Sungjun Lim , Seojin Park , Soeun Cheon , Kyungwoo Song

Multi-Objective Preference Optimization: Improving Human Alignment of Generative Models

Post-training of LLMs with RLHF, and subsequently preference optimization algorithms such as DPO, IPO, etc., made a big difference in improving human alignment. However, all such techniques can only work with a single (human) objective. In…

Machine Learning · Computer Science 2025-05-19 Akhil Agnihotri , Rahul Jain , Deepak Ramachandran , Zheng Wen

BinaryPPO: Efficient Policy Optimization for Binary Classification

Supervised fine-tuning (SFT) is the standard approach for binary classification tasks such as toxicity detection, factuality verification, and causal inference. However, SFT often performs poorly in real-world settings with label noise,…

Machine Learning · Computer Science 2026-02-04 Punya Syon Pandey , Zhijing Jin

MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples

Aligning Large Language Models (LLMs) with human feedback is crucial for their development. Existing preference optimization methods such as DPO and KTO, while improved based on Reinforcement Learning from Human Feedback (RLHF), are…

Computation and Language · Computer Science 2024-12-23 Shuo Xie , Fangzhi Zhu , Jiahui Wang , Lulu Wen , Wei Dai , Xiaowei Chen , Junxiong Zhu , Kai Zhou , Bo Zheng

InfoPO: On Mutual Information Maximization for Large Language Model Alignment

We study the post-training of large language models (LLMs) with human preference data. Recently, direct preference optimization and its variants have shown considerable promise in aligning language models, eliminating the need for reward…

Machine Learning · Computer Science 2025-05-14 Teng Xiao , Zhen Ge , Sujay Sanghavi , Tian Wang , Julian Katz-Samuels , Marc Versage , Qingjun Cui , Trishul Chilimbi

RosePO: Aligning LLM-based Recommenders with Human Values

Recently, there has been a growing interest in leveraging Large Language Models (LLMs) for recommendation systems, which usually adapt a pre-trained LLM to the recommendation scenario through supervised fine-tuning (SFT). However, both the…

Information Retrieval · Computer Science 2024-10-17 Jiayi Liao , Xiangnan He , Ruobing Xie , Jiancan Wu , Yancheng Yuan , Xingwu Sun , Zhanhui Kang , Xiang Wang

PIPA: Preference Alignment as Prior-Informed Statistical Estimation

Offline preference alignment for language models such as Direct Preference Optimization (DPO) is favored for its effectiveness and simplicity, eliminating the need for costly reinforcement learning. Various offline algorithms have been…

Machine Learning · Computer Science 2025-07-28 Junbo Li , Zhangyang Wang , Qiang Liu