Related papers: Binary Classifier Optimization for Large Language …

Personalizing LLMs with Binary Feedback: A Preference-Corrected Optimization Framework

Large Language Model (LLM) personalization aims to align model behaviors with individual user preferences. Existing methods often focus on isolated user histories, neglecting the essential role of inter-user differences. We propose C-BPO, a…

Computation and Language · Computer Science 2026-05-12 Xilai Ma , Liye Zhao , Weijun Yao , Haibing Di , Wenya Wang , Jing Li

As Simple as Fine-tuning: LLM Alignment via Bidirectional Negative Feedback Loss

Direct Preference Optimization (DPO) has emerged as a more computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF) with Proximal Policy Optimization (PPO), eliminating the need for reward models and online…

Computation and Language · Computer Science 2024-10-28 Xin Mao , Feng-Lin Li , Huimin Xu , Wei Zhang , Wang Chen , Anh Tuan Luu

BPO: Revisiting Preference Modeling in Direct Preference Optimization

Direct Preference Optimization (DPO) have emerged as a popular method for aligning Large Language Models (LLMs) with human preferences. While DPO effectively preserves the relative ordering between chosen and rejected responses through…

Computation and Language · Computer Science 2025-06-05 Lin Sun , Chuang Liu , Peng Liu , Bingyang Li , Weijia Lu , Ning Wu

Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization

Direct preference optimization (DPO), a widely adopted offline preference optimization algorithm, aims to align large language models (LLMs) with human-desired behaviors using pairwise preference data. However, the generation of the winning…

Computation and Language · Computer Science 2025-02-19 Yuxin Jiang , Bo Huang , Yufei Wang , Xingshan Zeng , Liangyou Li , Yasheng Wang , Xin Jiang , Lifeng Shang , Ruiming Tang , Wei Wang

Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment

Direct Preference Optimization (DPO) has become a prominent method for aligning Large Language Models (LLMs) with human preferences. While DPO has enabled significant progress in aligning English LLMs, multilingual preference alignment is…

Computation and Language · Computer Science 2025-06-06 Wen Yang , Junhong Wu , Chen Wang , Chengqing Zong , Jiajun Zhang

Black-Box Prompt Optimization: Aligning Large Language Models without Model Training

Large language models (LLMs) have shown impressive success in various applications. However, these models are often not well aligned with human intents, which calls for additional treatments on them; that is, the alignment problem. To make…

Computation and Language · Computer Science 2024-06-24 Jiale Cheng , Xiao Liu , Kehan Zheng , Pei Ke , Hongning Wang , Yuxiao Dong , Jie Tang , Minlie Huang

BinaryPPO: Efficient Policy Optimization for Binary Classification

Supervised fine-tuning (SFT) is the standard approach for binary classification tasks such as toxicity detection, factuality verification, and causal inference. However, SFT often performs poorly in real-world settings with label noise,…

Machine Learning · Computer Science 2026-02-04 Punya Syon Pandey , Zhijing Jin

Beyond Pairwise: Empowering LLM Alignment With Ranked Choice Modeling

Alignment of large language models (LLMs) has predominantly relied on pairwise preference optimization, where annotators select the better of two responses to a prompt. While simple, this approach overlooks the opportunity to learn from…

Machine Learning · Computer Science 2026-02-11 Yuxuan Tang , Yifan Feng

CAPO: Confidence Aware Preference Optimization Learning for Multilingual Preferences

Preference optimization is a critical post-training technique used to align large language models (LLMs) with human preferences, typically by fine-tuning on ranked response pairs. While methods like Direct Preference Optimization (DPO) have…

Computation and Language · Computer Science 2025-11-12 Rhitabrat Pokharel , Yufei Tao , Ameeta Agrawal

Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback

Large language models (LLMs) fine-tuned with alignment techniques, such as reinforcement learning from human feedback, have been instrumental in developing some of the most capable AI systems to date. Despite their success, existing methods…

Computation and Language · Computer Science 2025-07-01 Kyuyoung Kim , Ah Jeong Seo , Hao Liu , Jinwoo Shin , Kimin Lee

Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment

We study the problem of aligning large language models (LLMs) with human preference data. Contrastive preference optimization has shown promising results in aligning LLMs with available preference data by optimizing the implicit reward…

Machine Learning · Computer Science 2024-12-20 Teng Xiao , Yige Yuan , Huaisheng Zhu , Mingxiao Li , Vasant G Honavar

Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game

Human preference alignment is essential to improve the interaction quality of large language models (LLMs). Existing alignment methods depend on manually annotated preference data to guide the LLM optimization directions. However,…

Computation and Language · Computer Science 2024-06-04 Pengyu Cheng , Yifan Yang , Jian Li , Yong Dai , Tianhao Hu , Peixin Cao , Nan Du , Xiaolong Li

Unified Preference Optimization: Language Model Alignment Beyond the Preference Frontier

For aligning large language models (LLMs), prior work has leveraged reinforcement learning via human feedback (RLHF) or variations of direct preference optimization (DPO). While DPO offers a simpler framework based on maximum likelihood…

Artificial Intelligence · Computer Science 2025-05-27 Anirudhan Badrinath , Prabhat Agarwal , Jiajing Xu

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

In the field of large language models (LLMs), aligning models with the diverse preferences of users is a critical challenge. Direct Preference Optimization (DPO) has played a key role in this area. It works by using pairs of preferences…

Computation and Language · Computer Science 2024-05-29 Yueqin Yin , Zhendong Wang , Yi Gu , Hai Huang , Weizhu Chen , Mingyuan Zhou

Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO

Direct alignment methods typically train large language models (LLMs) by contrasting the likelihoods of preferred and dispreferred responses. While effective at capturing relative preferences, these methods are widely observed to suppress…

Computation and Language · Computer Science 2025-12-04 Kaiyang Guo , Yinchuan Li , Zhitang Chen

InfoPO: On Mutual Information Maximization for Large Language Model Alignment

We study the post-training of large language models (LLMs) with human preference data. Recently, direct preference optimization and its variants have shown considerable promise in aligning language models, eliminating the need for reward…

Machine Learning · Computer Science 2025-05-14 Teng Xiao , Zhen Ge , Sujay Sanghavi , Tian Wang , Julian Katz-Samuels , Marc Versage , Qingjun Cui , Trishul Chilimbi

Adaptive Preference Optimization with Uncertainty-aware Utility Anchor

Offline preference optimization methods are efficient for large language models (LLMs) alignment. Direct Preference optimization (DPO)-like learning, one of the most popular approaches, stands out for its efficiency in reward modeling.…

Machine Learning · Computer Science 2026-05-26 Xiaobo Wang , Zixia Jia , Jiaqi Li , Qi Liu , Zilong Zheng

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Large Language Models (LLMs) have become increasingly popular due to their ability to process and generate natural language. However, as they are trained on massive datasets of text, LLMs can inherit harmful biases and produce outputs that…

Computation and Language · Computer Science 2025-01-23 Qi Gou , Cam-Tu Nguyen

Bootstrapping Language Models with DPO Implicit Rewards

Human alignment in large language models (LLMs) is an active area of research. A recent groundbreaking work, direct preference optimization (DPO), has greatly simplified the process from past work in reinforcement learning from human…

Computation and Language · Computer Science 2025-03-10 Changyu Chen , Zichen Liu , Chao Du , Tianyu Pang , Qian Liu , Arunesh Sinha , Pradeep Varakantham , Min Lin

CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation

Large language models (LLMs) have shown great potential in natural language processing tasks, but their application to machine translation (MT) remains challenging due to pretraining on English-centric data and the complexity of…

Computation and Language · Computer Science 2025-01-24 Guofeng Cui , Pichao Wang , Yang Liu , Zemian Ke , Zhu Liu , Vimal Bhat