English
Related papers

Related papers: Iterative Reasoning Preference Optimization

200 papers

Recent studies have shown that large language models' (LLMs) mathematical problem-solving capabilities can be enhanced by integrating external tools, such as code interpreters, and employing multi-turn Chain-of-Thought (CoT) reasoning.…

Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with…

Computation and Language · Computer Science 2025-07-29 Songjun Tu , Jiahao Lin , Xiangyu Tian , Qichao Zhang , Linjing Li , Yuqian Fu , Nan Xu , Wei He , Xiangyuan Lan , Dongmei Jiang , Dongbin Zhao

Preference optimization methods have been successfully applied to improve not only the alignment of large language models (LLMs) with human values, but also specific natural language tasks such as summarization and stylistic continuations.…

Machine Learning · Computer Science 2025-02-06 Salem Lahlou , Abdalgader Abubaker , Hakim Hacid

Direct Preference Optimization (DPO), a standard method for aligning language models with human preferences, is traditionally applied to offline preferences. Recent studies show that DPO benefits from iterative training with online…

Computation and Language · Computer Science 2024-06-18 Jie Liu , Zhanhui Zhou , Jiaheng Liu , Xingyuan Bu , Chao Yang , Han-Sen Zhong , Wanli Ouyang

Effective training of language models (LMs) for mathematical reasoning tasks demands high-quality supervised fine-tuning data. Besides obtaining annotations from human experts, a common alternative is sampling from larger and more powerful…

Computation and Language · Computer Science 2024-07-26 Tianduo Wang , Shichen Li , Wei Lu

Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) due to the extensive and precise chain of reasoning required for accuracy. Ensuring the correctness of each reasoning step is critical. To address…

Machine Learning · Computer Science 2024-06-28 Xin Lai , Zhuotao Tian , Yukang Chen , Senqiao Yang , Xiangru Peng , Jiaya Jia

Preference optimization methods such as DPO and KTO are widely used for aligning language models, yet little is understood about what properties of preference data drive downstream reasoning gains. We ask: what aspects of a preference pair…

Computation and Language · Computer Science 2026-04-13 Chia-Hsuan Lee , Mingyang Zhou , Renkun Ni , Zelei Cheng , Sihui Dai , Supriyo Chakraborty , Shixiong Zhang , Sambit Sahu , William Campbell

Preference optimization techniques, such as Direct Preference Optimization (DPO), are frequently employed to enhance the reasoning capabilities of large language models (LLMs) in domains like mathematical reasoning and coding, typically…

Computation and Language · Computer Science 2025-02-17 Fangkai Jiao , Geyang Guo , Xingxing Zhang , Nancy F. Chen , Shafiq Joty , Furu Wei

Machine unlearning has gained increasing attention in recent years, as a promising technique to selectively remove unwanted privacy or copyrighted information from Large Language Models that are trained on a massive scale of human data.…

Computation and Language · Computer Science 2026-04-20 Junyi Li , Yongqiang Chen , Ningning Ding

We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process inspired by the successful strategy employed by AlphaZero. Our work leverages Monte…

Artificial Intelligence · Computer Science 2024-06-19 Yuxi Xie , Anirudh Goyal , Wenyue Zheng , Min-Yen Kan , Timothy P. Lillicrap , Kenji Kawaguchi , Michael Shieh

Direct Preference Optimization (DPO) have emerged as a popular method for aligning Large Language Models (LLMs) with human preferences. While DPO effectively preserves the relative ordering between chosen and rejected responses through…

Computation and Language · Computer Science 2025-06-05 Lin Sun , Chuang Liu , Peng Liu , Bingyang Li , Weijia Lu , Ning Wu

Recent advancements in the field of large language models, particularly through the Chain of Thought (CoT) approach, have demonstrated significant improvements in solving complex problems. However, existing models either tend to sacrifice…

Computation and Language · Computer Science 2025-12-30 Yijiong Yu

Direct Preference Optimization (DPO) is an effective framework for aligning large language models with human preferences, but it struggles with complex reasoning tasks. DPO optimizes for the likelihood of generating preferred over…

Artificial Intelligence · Computer Science 2026-04-23 Darsh Kachroo , Adriana Caraeni , Arjun Prasaath Anbazhagan , Brennan Lagasse , Kevin Zhu

Large Language Models (LLMs) have demonstrated significant potential in handling complex reasoning tasks through step-by-step rationale generation. However, recent studies have raised concerns regarding the hallucination and flaws in their…

Artificial Intelligence · Computer Science 2024-10-16 Fangkai Jiao , Chengwei Qin , Zhengyuan Liu , Nancy F. Chen , Shafiq Joty

Reinforcement Learning with Human Feedback (RLHF) enhances the alignment of Large Language Models (LLMs). However, its limitations have led to the development of Direct Preference Optimization (DPO), an RL-free approach designed to overcome…

Computation and Language · Computer Science 2025-02-19 Amir Saeidi , Shivanshu Verma , Aswin RRV , Kashif Rasul , Chitta Baral

Recently, preference optimization methods such as DPO have significantly enhanced large language models (LLMs) in wide tasks including dialogue and question-answering. However, current methods fail to account for the varying difficulty…

Computation and Language · Computer Science 2024-12-31 Jingyuan Ma , Rui Li , Zheng Li , Lei Sha , Zhifang Sui

Current approaches for strengthening LLM reasoning tend to introduce a training bias toward human-like reasoning trajectories. In step-wise preference optimization, in particular, dependence on human or higher-capacity model annotations for…

Computation and Language · Computer Science 2026-02-03 Junjie Lu , Yuliang Liu , Chaofeng Qu , Wei Shen , Zhouhan Lin , Chuheng Zhang , Min Xu

Supervised Fine-Tuning (SFT) has been a go-to and effective method for enhancing long chain-of-thought (CoT) reasoning in relatively small LLMs by fine-tuning them with long CoT responses from larger LLMs. To continually improve reasoning…

Machine Learning · Computer Science 2025-02-20 Wang Yang , Hongye Jin , Jingfeng Yang , Vipin Chaudhary , Xiaotian Han

The class of direct preference optimization (DPO) algorithms has emerged as a promising approach for solving the alignment problem in foundation models. These algorithms work with very limited feedback in the form of pairwise preferences…

Machine Learning · Computer Science 2026-02-03 Luca Viano , Ruida Zhou , Yifan Sun , Mahdi Namazifar , Volkan Cevher , Shoham Sabach , Mohammad Ghavamzadeh

Direct Preference Optimization (DPO) and its variants have become the de facto standards for aligning large language models (LLMs) with human preferences or specific goals. However, DPO requires high-quality preference data and suffers from…

Machine Learning · Computer Science 2024-11-12 Zhuotong Chen , Fang Liu , Jennifer Zhu , Wanyu Du , Yanjun Qi
‹ Prev 1 2 3 10 Next ›