English
Related papers

Related papers: Are PPO-ed Language Models Hackable?

200 papers

Proximal Policy Optimization (PPO) is commonly used in Reinforcement Learning from Human Feedback to align large language models (LLMs) with downstream tasks. This paper investigates the feasibility of using PPO for direct reinforcement…

Computation and Language · Computer Science 2024-10-23 Alexander G. Padula , Dennis J. N. J. Soemers

Language models are capable of iteratively improving their outputs based on natural language feedback, thus enabling in-context optimization of user preference. In place of human users, a second language model can be used as an evaluator,…

Computation and Language · Computer Science 2024-07-08 Jane Pan , He He , Samuel R. Bowman , Shi Feng

We propose a method for training language models in an interactive setting inspired by child language acquisition. In our setting, a speaker attempts to communicate some information to a listener in a single-turn dialogue and receives a…

Computation and Language · Computer Science 2025-05-12 Lennart Stöpler , Rufat Asadli , Mitja Nikolaus , Ryan Cotterell , Alex Warstadt

Human use language not just to convey information but also to express their inner feelings and mental states. In this work, we adapt the state-of-the-art language generation models to generate affective (emotional) text. We posit a model…

Computation and Language · Computer Science 2020-11-10 Ishika Singh , Ahsan Barkati , Tushar Goswamy , Ashutosh Modi

While alignment algorithms are now commonly used to tune pre-trained language models towards a user's preferences, we lack explanations for the underlying mechanisms in which models become ``aligned'', thus making it difficult to explain…

Computation and Language · Computer Science 2024-01-05 Andrew Lee , Xiaoyan Bai , Itamar Pres , Martin Wattenberg , Jonathan K. Kummerfeld , Rada Mihalcea

Supervised fine-tuning (SFT) has emerged as a crucial method for aligning large language models (LLMs) with human-annotated demonstrations. However, SFT, being an off-policy approach similar to behavior cloning, often struggles with…

Computation and Language · Computer Science 2025-10-27 Qingru Zhang , Liang Qiu , Ilgee Hong , Zhenghao Xu , Tianyi Liu , Shiyang Li , Rongzhi Zhang , Zheng Li , Lihong Li , Bing Yin , Chao Zhang , Jianshu Chen , Haoming Jiang , Tuo Zhao

Group Relative Policy Optimization (GRPO) has been shown to be an effective algorithm when an accurate reward model is available. However, such a highly reliable reward model is not available in many real-world tasks. In this paper, we…

Machine Learning · Computer Science 2026-01-12 Yuki Ichihara , Yuu Jinnai , Tetsuro Morimura , Mitsuki Sakamoto , Ryota Mitsuhashi , Eiji Uchibe

Fine-tuned large language models can exhibit reward-hacking behavior arising from emergent misalignment, which is difficult to detect from final outputs alone. While prior work has studied reward hacking at the level of completed responses,…

Computation and Language · Computer Science 2026-03-05 Patrick Wilhelm , Thorsten Wittkopp , Odej Kao

Language model calibration refers to the alignment between the confidence of the model and the actual performance of its responses. While previous studies point out the overconfidence phenomenon in Large Language Models (LLMs) and show that…

Computation and Language · Computer Science 2025-03-04 Jixuan Leng , Chengsong Huang , Banghua Zhu , Jiaxin Huang

Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach to aligning large language models (LLMs) to human preferences. Among the plethora of RLHF techniques, proximal policy optimization (PPO) is of the most…

Computation and Language · Computer Science 2023-11-06 Banghua Zhu , Hiteshi Sharma , Felipe Vieira Frujeri , Shi Dong , Chenguang Zhu , Michael I. Jordan , Jiantao Jiao

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing…

Machine Learning · Computer Science 2024-07-31 Rafael Rafailov , Archit Sharma , Eric Mitchell , Stefano Ermon , Christopher D. Manning , Chelsea Finn

Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when…

Computation and Language · Computer Science 2025-05-20 Zae Myung Kim , Chanwoo Park , Vipul Raheja , Suin Kim , Dongyeop Kang

Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning is vulnerable to reward hacking, whereby LM policies learn undesirable behaviors from flawed RMs. By…

Computation and Language · Computer Science 2026-03-05 Daniel Fein , Max Lamparth , Violet Xiang , Mykel J. Kochenderfer , Nick Haber

The Reinforcement Learning from Human Feedback (RLHF) plays a pivotal role in shaping the impact of large language models (LLMs), contributing significantly to controlling output toxicity and selecting output styles, particularly as LLMs…

Artificial Intelligence · Computer Science 2023-08-11 Miao Fan , Chen Hu , Shuchang Zhou

Reinforcement Learning (RL) has been witnessed its potential for training a dialogue policy agent towards maximizing the accumulated rewards given from users. However, the reward can be very sparse for it is usually only provided at the end…

Computation and Language · Computer Science 2021-11-03 Hongru Wang , Huimin Wang , Zezhong Wang , Kam-Fai Wong

Preference optimization has become a central paradigm for aligning large language models with human feedback. Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback by directly optimizing pairwise…

Machine Learning · Computer Science 2026-05-05 Inoussa Mouiche

We explore unconstrained natural language feedback as a learning signal for artificial agents. Humans use rich and varied language to teach, yet most prior work on interactive learning from language assumes a particular form of input (e.g.,…

Artificial Intelligence · Computer Science 2021-07-06 Theodore R. Sumers , Mark K. Ho , Robert D. Hawkins , Karthik Narasimhan , Thomas L. Griffiths

Transformer-based language models excel in NLP tasks, but fine-grained control remains challenging. This paper explores methods for manipulating transformer models through principled interventions at three levels: prompts, activations, and…

Computation and Language · Computer Science 2025-09-08 Faruk Alpay , Taylan Alpay

While direct policy optimization methods exist, pioneering LLMs are fine-tuned with reinforcement learning from human feedback (RLHF) to generate better responses under the supervision of a reward model learned from preference data. One…

Machine Learning · Computer Science 2025-06-10 Chuheng Zhang , Wei Shen , Li Zhao , Xuyun Zhang , Xiaolong Xu , Wanchun Dou , Jiang Bian

Instruction-fine-tuned large language models (LLMs) under 14B parameters continue to underperform on natural language understanding (NLU) tasks, often trailing smaller models like BERT-base on benchmarks such as GLUE and SuperGLUE.…

Computation and Language · Computer Science 2025-09-29 Bokai Hu , Sai Ashish Somayajula , Xin Pan , Pengtao Xie
‹ Prev 1 2 3 10 Next ›