Related papers: Iterative Reasoning Preference Optimization

Building Math Agents with Multi-Turn Iterative Preference Learning

Recent studies have shown that large language models' (LLMs) mathematical problem-solving capabilities can be enhanced by integrating external tools, such as code interpreters, and employing multi-turn Chain-of-Thought (CoT) reasoning.…

Machine Learning · Computer Science 2025-03-03 Wei Xiong , Chengshuai Shi , Jiaming Shen , Aviv Rosenberg , Zhen Qin , Daniele Calandriello , Misha Khalman , Rishabh Joshi , Bilal Piot , Mohammad Saleh , Chi Jin , Tong Zhang , Tianqi Liu

Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with…

Computation and Language · Computer Science 2025-07-29 Songjun Tu , Jiahao Lin , Xiangyu Tian , Qichao Zhang , Linjing Li , Yuqian Fu , Nan Xu , Wei He , Xiangyuan Lan , Dongmei Jiang , Dongbin Zhao

PORT: Preference Optimization on Reasoning Traces

Preference optimization methods have been successfully applied to improve not only the alignment of large language models (LLMs) with human values, but also specific natural language tasks such as summarization and stylistic continuations.…

Machine Learning · Computer Science 2025-02-06 Salem Lahlou , Abdalgader Abubaker , Hakim Hacid

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

Direct Preference Optimization (DPO), a standard method for aligning language models with human preferences, is traditionally applied to offline preferences. Recent studies show that DPO benefits from iterative training with online…

Computation and Language · Computer Science 2024-06-18 Jie Liu , Zhanhui Zhou , Jiaheng Liu , Xingyuan Bu , Chao Yang , Han-Sen Zhong , Wanli Ouyang

Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Effective training of language models (LMs) for mathematical reasoning tasks demands high-quality supervised fine-tuning data. Besides obtaining annotations from human experts, a common alternative is sampling from larger and more powerful…

Computation and Language · Computer Science 2024-07-26 Tianduo Wang , Shichen Li , Wei Lu

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) due to the extensive and precise chain of reasoning required for accuracy. Ensuring the correctness of each reasoning step is critical. To address…

Machine Learning · Computer Science 2024-06-28 Xin Lai , Zhuotao Tian , Yukang Chen , Senqiao Yang , Xiangru Peng , Jiaya Jia

Decomposing the Delta: What Do Models Actually Learn from Preference Pairs?

Preference optimization methods such as DPO and KTO are widely used for aligning language models, yet little is understood about what properties of preference data drive downstream reasoning gains. We ask: what aspects of a preference pair…

Computation and Language · Computer Science 2026-04-13 Chia-Hsuan Lee , Mingyang Zhou , Renkun Ni , Zelei Cheng , Sihui Dai , Supriyo Chakraborty , Shixiong Zhang , Sambit Sahu , William Campbell

Preference Optimization for Reasoning with Pseudo Feedback

Preference optimization techniques, such as Direct Preference Optimization (DPO), are frequently employed to enhance the reasoning capabilities of large language models (LLMs) in domains like mathematical reasoning and coding, typically…

Computation and Language · Computer Science 2025-02-17 Fangkai Jiao , Geyang Guo , Xingxing Zhang , Nancy F. Chen , Shafiq Joty , Furu Wei

CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization

Machine unlearning has gained increasing attention in recent years, as a promising technique to selectively remove unwanted privacy or copyrighted information from Large Language Models that are trained on a massive scale of human data.…

Computation and Language · Computer Science 2026-04-20 Junyi Li , Yongqiang Chen , Ningning Ding

Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process inspired by the successful strategy employed by AlphaZero. Our work leverages Monte…

Artificial Intelligence · Computer Science 2024-06-19 Yuxi Xie , Anirudh Goyal , Wenyue Zheng , Min-Yen Kan , Timothy P. Lillicrap , Kenji Kawaguchi , Michael Shieh

BPO: Revisiting Preference Modeling in Direct Preference Optimization

Direct Preference Optimization (DPO) have emerged as a popular method for aligning Large Language Models (LLMs) with human preferences. While DPO effectively preserves the relative ordering between chosen and rejected responses through…

Computation and Language · Computer Science 2025-06-05 Lin Sun , Chuang Liu , Peng Liu , Bingyang Li , Weijia Lu , Ning Wu

Patience Is The Key to Large Language Model Reasoning

Recent advancements in the field of large language models, particularly through the Chain of Thought (CoT) approach, have demonstrated significant improvements in solving complex problems. However, existing models either tend to sacrifice…

Computation and Language · Computer Science 2025-12-30 Yijiong Yu

HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

Direct Preference Optimization (DPO) is an effective framework for aligning large language models with human preferences, but it struggles with complex reasoning tasks. DPO optimizes for the likelihood of generating preferred over…

Artificial Intelligence · Computer Science 2026-04-23 Darsh Kachroo , Adriana Caraeni , Arjun Prasaath Anbazhagan , Brennan Lagasse , Kevin Zhu

Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing

Large Language Models (LLMs) have demonstrated significant potential in handling complex reasoning tasks through step-by-step rationale generation. However, recent studies have raised concerns regarding the hallucination and flaws in their…

Artificial Intelligence · Computer Science 2024-10-16 Fangkai Jiao , Chengwei Qin , Zhengyuan Liu , Nancy F. Chen , Shafiq Joty

Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization

Reinforcement Learning with Human Feedback (RLHF) enhances the alignment of Large Language Models (LLMs). However, its limitations have led to the development of Direct Preference Optimization (DPO), an RL-free approach designed to overcome…

Computation and Language · Computer Science 2025-02-19 Amir Saeidi , Shivanshu Verma , Aswin RRV , Kashif Rasul , Chitta Baral

Plug-and-Play Training Framework for Preference Optimization

Recently, preference optimization methods such as DPO have significantly enhanced large language models (LLMs) in wide tasks including dialogue and question-answering. However, current methods fail to account for the varying difficulty…

Computation and Language · Computer Science 2024-12-31 Jingyuan Ma , Rui Li , Zheng Li , Lei Sha , Zhifang Sui

Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization

Current approaches for strengthening LLM reasoning tend to introduce a training bias toward human-like reasoning trajectories. In step-wise preference optimization, in particular, dependence on human or higher-capacity model annotations for…

Computation and Language · Computer Science 2026-02-03 Junjie Lu , Yuliang Liu , Chaofeng Qu , Wei Shen , Zhouhan Lin , Chuheng Zhang , Min Xu

Thinking Preference Optimization

Supervised Fine-Tuning (SFT) has been a go-to and effective method for enhancing long chain-of-thought (CoT) reasoning in relatively small LLMs by fine-tuning them with long CoT responses from larger LLMs. To continually improve reasoning…

Machine Learning · Computer Science 2025-02-20 Wang Yang , Hongye Jin , Jingfeng Yang , Vipin Chaudhary , Xiaotian Han

Direct Preference Optimization with Rating Information: Practical Algorithms and Provable Gains

The class of direct preference optimization (DPO) algorithms has emerged as a promising approach for solving the alignment problem in foundation models. These algorithms work with very limited feedback in the form of pairwise preferences…

Machine Learning · Computer Science 2026-02-03 Luca Viano , Ruida Zhou , Yifan Sun , Mahdi Namazifar , Volkan Cevher , Shoham Sabach , Mohammad Ghavamzadeh

Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization

Direct Preference Optimization (DPO) and its variants have become the de facto standards for aligning large language models (LLMs) with human preferences or specific goals. However, DPO requires high-quality preference data and suffers from…

Machine Learning · Computer Science 2024-11-12 Zhuotong Chen , Fang Liu , Jennifer Zhu , Wanyu Du , Yanjun Qi