English
Related papers

Related papers: MPO: An Efficient Post-Processing Framework for Mi…

200 papers

Large Language Models (LLMs) have become increasingly popular due to their ability to process and generate natural language. However, as they are trained on massive datasets of text, LLMs can inherit harmful biases and produce outputs that…

Computation and Language · Computer Science 2025-01-23 Qi Gou , Cam-Tu Nguyen

A single language model, even when aligned with labelers through reinforcement learning from human feedback (RLHF), may not suit all human preferences. Recent approaches therefore prefer customization, gathering multi-dimensional feedback,…

Machine Learning · Computer Science 2024-08-20 Zhanhui Zhou , Jie Liu , Jing Shao , Xiangyu Yue , Chao Yang , Wanli Ouyang , Yu Qiao

Aligning Large Language Models (LLMs) with human feedback is crucial for their development. Existing preference optimization methods such as DPO and KTO, while improved based on Reinforcement Learning from Human Feedback (RLHF), are…

Computation and Language · Computer Science 2024-12-23 Shuo Xie , Fangzhi Zhu , Jiahui Wang , Lulu Wen , Wei Dai , Xiaowei Chen , Junxiong Zhu , Kai Zhou , Bo Zheng

For aligning large language models (LLMs), prior work has leveraged reinforcement learning via human feedback (RLHF) or variations of direct preference optimization (DPO). While DPO offers a simpler framework based on maximum likelihood…

Artificial Intelligence · Computer Science 2025-05-27 Anirudhan Badrinath , Prabhat Agarwal , Jiajing Xu

Post-training of LLMs with RLHF, and subsequently preference optimization algorithms such as DPO, IPO, etc., made a big difference in improving human alignment. However, all such techniques can only work with a single (human) objective. In…

Machine Learning · Computer Science 2025-05-19 Akhil Agnihotri , Rahul Jain , Deepak Ramachandran , Zheng Wen

Preference learning is a key technology for aligning language models with human values. Reinforcement Learning from Human Feedback (RLHF) is a model-based algorithm to optimize preference learning, which first fits a reward model for…

Machine Learning · Computer Science 2024-03-26 Zaifan Jiang , Xing Huang , Chao Wei

Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful technique for aligning large language models (LLMs) with human preferences. However, effectively aligning LLMs with diverse human preferences remains a significant…

Computation and Language · Computer Science 2025-07-03 Chengao Li , Hanyu Zhang , Yunkun Xu , Hongyan Xue , Xiang Ao , Qing He

Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models with human preferences. However, reward-based methods grounded in the Bradley-Terry assumption struggle to capture the…

Artificial Intelligence · Computer Science 2026-04-08 Fang Wu , Xu Huang , Weihao Xuan , Zhiwei Zhang , Yijia Xiao , Guancheng Wan , Xiaomin Li , Bing Hu , Peng Xia , Jure Leskovec , Yejin Choi

Aligning human preference and value is an important requirement for building contemporary foundation models and embodied AI. However, popular approaches such as reinforcement learning with human feedback (RLHF) break down the task into…

Artificial Intelligence · Computer Science 2024-12-03 Chenliang Li , Siliang Zeng , Zeyi Liao , Jiaxiang Li , Dongyeop Kang , Alfredo Garcia , Mingyi Hong

Reinforcement Learning from Human Feedback (RLHF) has been proven to be an effective method for preference alignment of large language models (LLMs) and is widely used in the post-training process of LLMs. However, RLHF struggles with…

Computation and Language · Computer Science 2024-11-05 Dongxu Liu , Bing Xu , Yinzhuo Chen , Bufan Xu , Wenpeng Lu , Muyun Yang , Tiejun Zhao

Reinforcement learning from human feedback (RLHF) has demonstrated great promise in aligning large language models (LLMs) with human preference. Depending on the availability of preference data, both online and offline RLHF are active areas…

Machine Learning · Computer Science 2025-02-20 Shicong Cen , Jincheng Mei , Katayoon Goshvadi , Hanjun Dai , Tong Yang , Sherry Yang , Dale Schuurmans , Yuejie Chi , Bo Dai

Reinforcement Learning with Human Feedback (RLHF) is a widely used fine-tuning approach that aligns machine learning model, particularly Language Model (LM) with human preferences. There are typically multiple objectives driving the…

Machine Learning · Computer Science 2025-02-25 Nuoya Xiong , Aarti Singh

The alignment of large language models with human values presents a critical challenge, particularly when balancing conflicting objectives like helpfulness and harmlessness. Existing approaches, such as Reinforcement Learning from Human…

Computation and Language · Computer Science 2025-03-04 Yuxuan Liu

AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is increasingly treated as a crucial ingredient for high performance large language models. Proximal Policy Optimization (PPO) has been positioned by recent…

While Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) with general, aggregate human preferences, it is suboptimal for learning diverse, individual perspectives. In this work, we study Reinforcement…

Reinforcement Learning from Human Feedback (RLHF) is currently the leading approach for aligning large language models with human preferences. Typically, these models rely on extensive offline preference datasets for training. However,…

Machine Learning · Computer Science 2024-12-17 Avinandan Bose , Zhihan Xiong , Aadirupa Saha , Simon Shaolei Du , Maryam Fazel

Alignment of large language models (LLMs) has predominantly relied on pairwise preference optimization, where annotators select the better of two responses to a prompt. While simple, this approach overlooks the opportunity to learn from…

Machine Learning · Computer Science 2026-02-11 Yuxuan Tang , Yifan Feng

Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with…

Machine Learning · Computer Science 2025-10-21 Keertana Chidambaram , Karthik Vinay Seetharaman , Vasilis Syrgkanis

Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with…

Artificial Intelligence · Computer Science 2025-10-20 Keertana Chidambaram , Karthik Vinary Seetharaman , Vasilis Syrgkanis

Large Language Models (LLMs) have demonstrated unprecedented generative capabilities, yet their alignment with human values remains critical for ensuring helpful and harmless deployments. While Reinforcement Learning from Human Feedback…

‹ Prev 1 2 3 10 Next ›