Related papers: Thinking Preference Optimization

ReFT: Reasoning with Reinforced Fine-Tuning

One way to enhance the reasoning capability of Large Language Models (LLMs) is to conduct Supervised Fine-Tuning (SFT) using Chain-of-Thought (CoT) annotations. This approach does not show sufficiently strong generalization ability,…

Computation and Language · Computer Science 2024-12-16 Trung Quoc Luong , Xinbo Zhang , Zhanming Jie , Peng Sun , Xiaoran Jin , Hang Li

Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models

Recent advances in large language models have demonstrated that Supervised Fine-Tuning (SFT) with Chain-of-Thought (CoT) reasoning data distilled from large reasoning models (e.g., DeepSeek R1) can effectively transfer reasoning…

Computation and Language · Computer Science 2025-05-22 Bin Yu , Hang Yuan , Haotian Li , Xueyin Xu , Yuliang Wei , Bailing Wang , Weizhen Qi , Kai Chen

Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process

Supervised Fine-Tuning (SFT) and Preference Optimization (PO) are key processes for aligning Language Models (LMs) with human preferences post pre-training. While SFT excels in efficiency and PO in effectiveness, they are often combined…

Computation and Language · Computer Science 2025-07-15 Ermo Hua , Biqing Qi , Kaiyan Zhang , Kai Tian , Xingtai Lv , Ning Ding , Bowen Zhou

SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression

Test-time scaling has proven effective in further enhancing the performance of pretrained Large Language Models (LLMs). However, mainstream post-training methods (i.e., reinforcement learning (RL) with chain-of-thought (CoT) reasoning)…

Machine Learning · Computer Science 2025-08-19 Yuyang Xu , Yi Cheng , Haochao Ying , Zhuoyun Du , Renjun Hu , Xing Shi , Wei Lin , Jian Wu

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) due to the extensive and precise chain of reasoning required for accuracy. Ensuring the correctness of each reasoning step is critical. To address…

Machine Learning · Computer Science 2024-06-28 Xin Lai , Zhuotao Tian , Yukang Chen , Senqiao Yang , Xiangru Peng , Jiaya Jia

Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Effective training of language models (LMs) for mathematical reasoning tasks demands high-quality supervised fine-tuning data. Besides obtaining annotations from human experts, a common alternative is sampling from larger and more powerful…

Computation and Language · Computer Science 2024-07-26 Tianduo Wang , Shichen Li , Wei Lu

ASFT: Aligned Supervised Fine-Tuning through Absolute Likelihood

Direct Preference Optimization (DPO) is a method for enhancing model performance by directly optimizing for the preferences or rankings of outcomes, instead of traditional loss functions. This approach has proven effective in aligning Large…

Machine Learning · Computer Science 2024-09-18 Ruoyu Wang , Jiachen Sun , Shaowei Hua , Quan Fang

Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization

Recent advances in Large Reasoning Models (LRMs) have demonstrated strong performance on complex tasks through long Chain-of-Thought (CoT) reasoning. However, their lengthy outputs increase computational costs and may lead to overthinking,…

Artificial Intelligence · Computer Science 2026-04-16 Bin Hong , Jiayu Liu , Kai Zhang , Jianwen Sun , Mengdi Zhang , Zhenya Huang

Preference-Oriented Supervised Fine-Tuning: Favoring Target Model Over Aligned Large Language Models

Alignment, endowing a pre-trained Large language model (LLM) with the ability to follow instructions, is crucial for its real-world applications. Conventional supervised fine-tuning (SFT) methods formalize it as causal language modeling…

Computation and Language · Computer Science 2024-12-18 Yuchen Fan , Yuzhong Hong , Qiushi Wang , Junwei Bao , Hongfei Jiang , Yang Song

Training Large Language Models To Reason In Parallel With Global Forking Tokens

Although LLMs have demonstrated improved performance by scaling parallel test-time compute, doing so relies on generating reasoning paths that are both diverse and accurate. For challenging problems, the forking tokens that trigger diverse…

Computation and Language · Computer Science 2026-03-03 Sheng Jia , Xiao Wang , Shiva Prasad Kasiviswanathan

Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

The recent development of chain-of-thought (CoT) decoding has enabled large language models (LLMs) to generate explicit logical reasoning paths for complex problem-solving. However, research indicates that these paths are not always…

Computation and Language · Computer Science 2024-11-01 Xuan Zhang , Chao Du , Tianyu Pang , Qian Liu , Wei Gao , Min Lin

Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only

Supervised fine-tuning (SFT) has emerged as a crucial method for aligning large language models (LLMs) with human-annotated demonstrations. However, SFT, being an off-policy approach similar to behavior cloning, often struggles with…

Computation and Language · Computer Science 2025-10-27 Qingru Zhang , Liang Qiu , Ilgee Hong , Zhenghao Xu , Tianyi Liu , Shiyang Li , Rongzhi Zhang , Zheng Li , Lihong Li , Bing Yin , Chao Zhang , Jianshu Chen , Haoming Jiang , Tuo Zhao

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs). However,…

Computation and Language · Computer Science 2026-02-26 Mengxuan Hu , Vivek V. Datla , Anoop Kumar , Zihan Guan , Sheng Li , Alfy Samuel , Daben Liu

TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees

In the domain of complex reasoning tasks, such as mathematical reasoning, recent advancements have proposed the use of Direct Preference Optimization (DPO) to suppress output of dispreferred responses, thereby enhancing the long-chain…

Computation and Language · Computer Science 2025-10-27 Weibin Liao , Xu Chu , Yasha Wang

LOGICPO: Efficient Translation of NL-based Logical Problems to FOL using LLMs and Preference Optimization

Logical reasoning is a key task for artificial intelligence due to it's role in major downstream tasks such as Question Answering, Summarization. Recent methods in improving the reasoning ability of LLMs fall short in correctly converting a…

Machine Learning · Computer Science 2025-06-24 Koushik Viswanadha , Deepanway Ghosal , Somak Aditya

Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization

Reinforcement Learning with Human Feedback (RLHF) enhances the alignment of Large Language Models (LLMs). However, its limitations have led to the development of Direct Preference Optimization (DPO), an RL-free approach designed to overcome…

Computation and Language · Computer Science 2025-02-19 Amir Saeidi , Shivanshu Verma , Aswin RRV , Kashif Rasul , Chitta Baral

ThinkDrive: Chain-of-Thought Guided Progressive Reinforcement Learning Fine-Tuning for Autonomous Driving

With the rapid advancement of large language models (LLMs) technologies, their application in the domain of autonomous driving has become increasingly widespread. However, existing methods suffer from unstructured reasoning, poor…

Artificial Intelligence · Computer Science 2026-01-09 Chang Zhao , Zheming Yang , Yunqing Hu , Qi Guo , Zijian Wang , Pengcheng Li , Wen Ji

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

Reasoning language models have shown an uncanny ability to improve performance at test-time by ``thinking longer''-that is, by generating longer chain-of-thought sequences and hence using more compute. However, the length of their…

Computation and Language · Computer Science 2025-10-06 Pranjal Aggarwal , Sean Welleck

Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with…

Computation and Language · Computer Science 2025-07-29 Songjun Tu , Jiahao Lin , Xiangyu Tian , Qichao Zhang , Linjing Li , Yuqian Fu , Nan Xu , Wei He , Xiangyuan Lan , Dongmei Jiang , Dongbin Zhao

When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents

Supervised fine-tuning (SFT) has emerged as one of the most effective ways to improve the performance of large language models (LLMs) in downstream tasks. However, SFT can have difficulty generalizing when the underlying data distribution…

Computation and Language · Computer Science 2025-12-15 Mrinal Rawat , Arkajyoti Chakraborty , Neha Gupta , Roberto Pieraccini