Related papers: BinaryPPO: Efficient Policy Optimization for Binar…

Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only

Supervised fine-tuning (SFT) has emerged as a crucial method for aligning large language models (LLMs) with human-annotated demonstrations. However, SFT, being an off-policy approach similar to behavior cloning, often struggles with…

Computation and Language · Computer Science 2025-10-27 Qingru Zhang , Liang Qiu , Ilgee Hong , Zhenghao Xu , Tianyi Liu , Shiyang Li , Rongzhi Zhang , Zheng Li , Lihong Li , Bing Yin , Chao Zhang , Jianshu Chen , Haoming Jiang , Tuo Zhao

Does Fine-tuning by Reinforcement Learning Improve Generalization in Binary Speech Deepfake Detection?

Building speech deepfake detection models that are generalizable to unseen attacks remains a challenging problem. Although the field has shifted toward a pre-training and fine-tuning paradigm using speech foundation models, most approaches…

Audio and Speech Processing · Electrical Eng. & Systems 2026-03-04 Xin Wang , Ge Wanying , Junichi Yamagishi

Rethinking the Trust Region in LLM Reinforcement Learning

Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio…

Machine Learning · Computer Science 2026-05-27 Penghui Qi , Xiangxin Zhou , Zichen Liu , Tianyu Pang , Chao Du , Min Lin , Wee Sun Lee

NFT: Bridging Supervised Learning and Reinforcement Learning in Math Reasoning

Reinforcement Learning (RL) has played a central role in the recent surge of LLMs' math abilities by enabling self-improvement through binary verifier signals. In contrast, Supervised Learning (SL) is rarely considered for such…

Machine Learning · Computer Science 2026-03-03 Huayu Chen , Kaiwen Zheng , Qinsheng Zhang , Ganqu Cui , Lifan Yuan , Yin Cui , Haotian Ye , Tsung-Yi Lin , Ming-Yu Liu , Jun Zhu , Haoxiang Wang

ASFT: Aligned Supervised Fine-Tuning through Absolute Likelihood

Direct Preference Optimization (DPO) is a method for enhancing model performance by directly optimizing for the preferences or rankings of outcomes, instead of traditional loss functions. This approach has proven effective in aligning Large…

Machine Learning · Computer Science 2024-09-18 Ruoyu Wang , Jiachen Sun , Shaowei Hua , Quan Fang

Binary Classifier Optimization for Large Language Model Alignment

In real-world services such as ChatGPT, aligning models based on user feedback is crucial for improving model performance. However, due to the simplicity and convenience of providing feedback, users typically offer only basic binary…

Machine Learning · Computer Science 2025-06-10 Seungjae Jung , Gunsoo Han , Daniel Wontae Nam , Kyoung-Woon On

Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections

Post-training processes are essential phases in grounding pre-trained language models to real-world tasks, with learning from demonstrations or preference signals playing a crucial role in this adaptation. We present a unified theoretical…

Machine Learning · Computer Science 2025-07-08 Bo Wang , Qinyuan Cheng , Runyu Peng , Rong Bao , Peiji Li , Qipeng Guo , Linyang Li , Zhiyuan Zeng , Yunhua Zhou , Xipeng Qiu

Bridging SFT and RL: Dynamic Policy Optimization for Robust Reasoning

Post-training paradigms for Large Language Models (LLMs), primarily Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), face a fundamental dilemma: SFT provides stability (low variance) but suffers from high fitting bias, while RL…

Machine Learning · Computer Science 2026-04-13 Taojie Zhu , Dongyang Xu , Ding Zou , Sen Zhao , Qiaobo Hao , Zhiguo Yang , Yonghong He

Assessing Robustness to Spurious Correlations in Post-Training Language Models

Supervised and preference-based fine-tuning techniques have become popular for aligning large language models (LLMs) with user intent and correctness criteria. However, real-world training data often exhibits spurious correlations --…

Computation and Language · Computer Science 2025-05-12 Julia Shuieh , Prasann Singhal , Apaar Shanker , John Heyer , George Pu , Samuel Denton

Thinking Preference Optimization

Supervised Fine-Tuning (SFT) has been a go-to and effective method for enhancing long chain-of-thought (CoT) reasoning in relatively small LLMs by fine-tuning them with long CoT responses from larger LLMs. To continually improve reasoning…

Machine Learning · Computer Science 2025-02-20 Wang Yang , Hongye Jin , Jingfeng Yang , Vipin Chaudhary , Xiaotian Han

RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization

Aligning large language models (LLMs) on domain-specific data remains a fundamental challenge. Supervised fine-tuning (SFT) offers a straightforward way to inject domain knowledge but often degrades the model's generality. In contrast,…

Machine Learning · Computer Science 2026-02-12 Linxuan Xia , Xiaolong Yang , Yongyuan Chen , Enyue Zhao , Deng Cai , Yasheng Wang , Boxi Wu

Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data

Supervised fine-tuning (SFT) has become a crucial step for aligning pretrained large language models (LLMs) using supervised datasets of input-output pairs. However, despite being supervised, SFT is inherently limited by its generative…

Computation and Language · Computer Science 2025-07-25 Siqi Guo , Ilgee Hong , Vicente Balmaseda , Changlong Yu , Liang Qiu , Xin Liu , Haoming Jiang , Tuo Zhao , Tianbao Yang

Proximal Supervised Fine-Tuning

Supervised fine-tuning (SFT) of foundation models often leads to poor generalization, where prior capabilities deteriorate after tuning on new tasks or domains. Inspired by trust-region policy optimization (TRPO) and proximal policy…

Machine Learning · Computer Science 2026-04-14 Wenhong Zhu , Ruobing Xie , Rui Wang , Xingwu Sun , Di Wang , Pengfei Liu

Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization

Existing post-training techniques are broadly categorized into supervised fine-tuning (SFT) and reinforcement learning (RL) methods; the former is stable during training but suffers from limited generalization, while the latter, despite its…

Computer Vision and Pattern Recognition · Computer Science 2025-07-02 Daoan Zhang , Guangchen Lan , Dong-Jun Han , Wenlin Yao , Xiaoman Pan , Hongming Zhang , Mingxiao Li , Pengcheng Chen , Yu Dong , Christopher Brinton , Jiebo Luo

Enhancing Blind Face Restoration through Online Reinforcement Learning

Blind Face Restoration (BFR) encounters inherent challenges in exploring its large solution space, leading to common artifacts like missing details and identity ambiguity in the restored images. To tackle these challenges, we propose a…

Computer Vision and Pattern Recognition · Computer Science 2025-12-22 Bin Wu , Yahui Liu , Chi Zhang , Yao Zhao , Wei Wang

Fine Tuning Large Language Models for Medicine: The Role and Importance of Direct Preference Optimization

Large Language Model (LLM) fine tuning is underutilized in the field of medicine. Two of the most common methods of fine tuning are Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO), but there is little guidance…

Computation and Language · Computer Science 2024-12-16 Thomas Savage , Stephen Ma , Abdessalem Boukil , Vishwesh Patel , Ekanath Rangan , Ivan Lopez , Jonathan H Chen

Mathematical Framework for Custom Reward Functions in Job Application Evaluation using Reinforcement Learning

Most of the traditional Applicant Tracking Systems (ATS) depend on strict matching using keywords, where candidates that are highly qualified are many times disqualified because of minor semantic differences. In this article, the two-stage…

Machine Learning · Computer Science 2026-01-21 Shreyansh Jain , Madhav Singhvi , Shreya Rahul Jain , Pranav S , Dishaa Lokesh , Naren Chittibabu , Akash Anandhan

On-Policy Supervised Fine-Tuning for Efficient Reasoning

Large reasoning models (LRMs) are commonly trained with reinforcement learning (RL) to explore long chain-of-thought reasoning, achieving strong performance at high computational cost. Recent methods add multi-reward objectives to jointly…

Artificial Intelligence · Computer Science 2026-02-17 Anhao Zhao , Ziyang Chen , Junlong Tong , Yingqi Fan , Fanghua Ye , Shuhao Li , Yunpu Ma , Wenjie Li , Xiaoyu Shen

ReFT: Reasoning with Reinforced Fine-Tuning

One way to enhance the reasoning capability of Large Language Models (LLMs) is to conduct Supervised Fine-Tuning (SFT) using Chain-of-Thought (CoT) annotations. This approach does not show sufficiently strong generalization ability,…

Computation and Language · Computer Science 2024-12-16 Trung Quoc Luong , Xinbo Zhang , Zhanming Jie , Peng Sun , Xiaoran Jin , Hang Li

Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

Aligning generative models with human preference via RLHF typically suffers from overoptimization, where an imperfectly learned reward model can misguide the generative model to output undesired responses. We investigate this problem in a…

Machine Learning · Computer Science 2024-12-05 Zhihan Liu , Miao Lu , Shenao Zhang , Boyi Liu , Hongyi Guo , Yingxiang Yang , Jose Blanchet , Zhaoran Wang