Related papers: Stepwise Alignment for Constrained Language Model …

Constrained Language Model Policy Optimization via Risk-aware Stepwise Alignment

When fine-tuning pre-trained Language Models (LMs) to exhibit desired behaviors, maintaining control over risk is critical for ensuring both safety and trustworthiness. Most existing safety alignment methods, such as Safe RLHF and SACPO,…

Artificial Intelligence · Computer Science 2026-01-01 Lijun Zhang , Lin Li , Wei Wei , Yajie Qi , Huizhong Song , Jun Wang , Yaodong Yang , Jiye Liang

MPO: Multilingual Safety Alignment via Reward Gap Optimization

Large language models (LLMs) have become increasingly central to AI applications worldwide, necessitating robust multilingual safety alignment to ensure secure deployment across diverse linguistic contexts. Existing preference learning…

Computation and Language · Computer Science 2025-05-23 Weixiang Zhao , Yulin Hu , Yang Deng , Tongtong Wu , Wenxuan Zhang , Jiahe Guo , An Zhang , Yanyan Zhao , Bing Qin , Tat-Seng Chua , Ting Liu

Optimizing Safe and Aligned Language Generation: A Multi-Objective GRPO Approach

Aligning large language models (LLMs) with human values and safety constraints is challenging, especially when objectives like helpfulness, truthfulness, and avoidance of harm conflict. Reinforcement Learning from Human Feedback (RLHF) has…

Computation and Language · Computer Science 2025-03-31 Xuying Li , Zhuo Li , Yuji Kosuga , Victor Bian

Learning to Align Human Code Preferences

Large Language Models (LLMs) have demonstrated remarkable potential in automating software development tasks. While recent advances leverage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align models with human…

Software Engineering · Computer Science 2025-12-09 Xin Yin , Chao Ni , Xiaohu Yang

Provably Convergent Primal-Dual DPO for Constrained LLM Alignment

The widespread application of large language models (LLMs) raises increasing demands on ensuring safety or imposing constraints, such as reducing harmful content and adhering to predefined rules. While there have been several works studying…

Machine Learning · Computer Science 2026-02-13 Yihan Du , Seo Taek Kong , R. Srikant

SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling

Human preference alignment is critical in building powerful and reliable large language models (LLMs). However, current methods either ignore the multi-dimensionality of human preferences (e.g. helpfulness and harmlessness) or struggle with…

Machine Learning · Computer Science 2024-10-14 Xingzhou Lou , Junge Zhang , Jian Xie , Lifeng Liu , Dong Yan , Kaiqi Huang

Step-level Value Preference Optimization for Mathematical Reasoning

Direct Preference Optimization (DPO) using an implicit reward model has proven to be an effective alternative to reinforcement learning from human feedback (RLHF) for fine-tuning preference aligned large language models (LLMs). However, the…

Computation and Language · Computer Science 2024-09-30 Guoxin Chen , Minpeng Liao , Chengxi Li , Kai Fan

Enhancing LLM Safety via Constrained Direct Preference Optimization

The rapidly increasing capabilities of large language models (LLMs) raise an urgent need to align AI systems with diverse human preferences to simultaneously enhance their usefulness and safety, despite the often conflicting nature of these…

Machine Learning · Computer Science 2024-03-06 Zixuan Liu , Xiaolin Sun , Zizhan Zheng

AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization

Aligning large language models (LLMs) with human values and intentions is crucial for their utility, honesty, and safety. Reinforcement learning from human feedback (RLHF) is a popular approach to achieve this alignment, but it faces…

Machine Learning · Computer Science 2025-07-22 Junkang Wu , Xue Wang , Zhengyi Yang , Jiancan Wu , Jinyang Gao , Bolin Ding , Xiang Wang , Xiangnan He

Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

Balancing helpfulness and safety (harmlessness) is a critical challenge in aligning large language models (LLMs). Current approaches often decouple these two objectives, training separate preference models for helpfulness and safety, while…

Machine Learning · Computer Science 2025-02-28 Xiyue Peng , Hengquan Guo , Jiawei Zhang , Dongqing Zou , Ziyu Shao , Honghao Wei , Xin Liu

SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

As Large Language Models (LLMs) are increasingly deployed in real-world applications, balancing helpfulness and safety has become a central challenge. A natural approach is to incorporate safety constraints into Reinforcement Learning from…

Machine Learning · Computer Science 2026-03-05 Geon-Hyeong Kim , Yu Jin Kim , Byoungjip Kim , Honglak Lee , Kyunghoon Bae , Youngsoo Jang , Moontae Lee

Gradient-Adaptive Policy Optimization: Towards Multi-Objective Alignment of Large Language Models

Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful technique for aligning large language models (LLMs) with human preferences. However, effectively aligning LLMs with diverse human preferences remains a significant…

Computation and Language · Computer Science 2025-07-03 Chengao Li , Hanyu Zhang , Yunkun Xu , Hongyan Xue , Xiang Ao , Qing He

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large…

Computation and Language · Computer Science 2026-04-21 Daoyu Wang , Qingchuan Li , Mingyue Cheng , Jie Ouyang , Shuo Yu , Qi Liu , Enhong Chen

SGDPO: Self-Guided Direct Preference Optimization for Language Model Alignment

Direct Preference Optimization (DPO) is broadly utilized for aligning Large Language Models (LLMs) with human values because of its flexibility. Despite its effectiveness, it has been observed that the capability of DPO to generate…

Machine Learning · Computer Science 2025-05-20 Wenqiao Zhu , Ji Liu , Lulu Wang , Jun Wu , Yulun Zhang

ASPO: Adaptive Sentence-Level Preference Optimization for Fine-Grained Multimodal Reasoning

Direct Preference Optimization (DPO) has gained significant attention for its simplicity and computational efficiency in aligning large language models (LLMs). Recent advancements have extended DPO to multimodal scenarios, achieving strong…

Computation and Language · Computer Science 2025-05-27 Yeyuan Wang , Dehong Gao , Rujiao Long , Lei Yi , Linbo Jin , Libin Yang , Xiaoyan Cai

Towards Efficient Exact Optimization of Language Model Alignment

The alignment of language models with human preferences is vital for their application in real-world tasks. The problem is formulated as optimizing the model's policy to maximize the expected reward that reflects human preferences with…

Computation and Language · Computer Science 2024-06-06 Haozhe Ji , Cheng Lu , Yilin Niu , Pei Ke , Hongning Wang , Jun Zhu , Jie Tang , Minlie Huang

Conformal Constrained Policy Optimization for Cost-Effective LLM Agents

While large language models (LLMs) have recently made tremendous progress towards solving challenging AI problems, they have done so at increasingly steep computational and API costs. We propose a novel strategy where we combine multiple…

Machine Learning · Computer Science 2026-03-24 Wenwen Si , Sooyong Jang , Insup Lee , Osbert Bastani

Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models

Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when…

Computation and Language · Computer Science 2025-05-20 Zae Myung Kim , Chanwoo Park , Vipul Raheja , Suin Kim , Dongyeop Kang

sDPO: Don't Use Your Data All at Once

As development of large language models (LLM) progresses, aligning them with human preferences has become increasingly important. We propose stepwise DPO (sDPO), an extension of the recently popularized direct preference optimization (DPO)…

Computation and Language · Computer Science 2024-10-08 Dahyun Kim , Yungi Kim , Wonho Song , Hyeonwoo Kim , Yunsu Kim , Sanghoon Kim , Chanjun Park

CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation

Large language models (LLMs) have shown great potential in natural language processing tasks, but their application to machine translation (MT) remains challenging due to pretraining on English-centric data and the complexity of…

Computation and Language · Computer Science 2025-01-24 Guofeng Cui , Pichao Wang , Yang Liu , Zemian Ke , Zhu Liu , Vimal Bhat