Related papers: Are PPO-ed Language Models Hackable?

Exploring RL-based LLM Training for Formal Language Tasks with Programmed Rewards

Proximal Policy Optimization (PPO) is commonly used in Reinforcement Learning from Human Feedback to align large language models (LLMs) with downstream tasks. This paper investigates the feasibility of using PPO for direct reinforcement…

Computation and Language · Computer Science 2024-10-23 Alexander G. Padula , Dennis J. N. J. Soemers

Spontaneous Reward Hacking in Iterative Self-Refinement

Language models are capable of iteratively improving their outputs based on natural language feedback, thus enabling in-context optimization of user preference. In place of human users, a second language model can be used as an evaluator,…

Computation and Language · Computer Science 2024-07-08 Jane Pan , He He , Samuel R. Bowman , Shi Feng

Towards Developmentally Plausible Rewards: Communicative Success as a Learning Signal for Interactive Language Models

We propose a method for training language models in an interactive setting inspired by child language acquisition. In our setting, a speaker attempts to communicate some information to a listener in a single-turn dialogue and receives a…

Computation and Language · Computer Science 2025-05-12 Lennart Stöpler , Rufat Asadli , Mitja Nikolaus , Ryan Cotterell , Alex Warstadt

Adapting a Language Model for Controlled Affective Text Generation

Human use language not just to convey information but also to express their inner feelings and mental states. In this work, we adapt the state-of-the-art language generation models to generate affective (emotional) text. We posit a model…

Computation and Language · Computer Science 2020-11-10 Ishika Singh , Ahsan Barkati , Tushar Goswamy , Ashutosh Modi

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

While alignment algorithms are now commonly used to tune pre-trained language models towards a user's preferences, we lack explanations for the underlying mechanisms in which models become ``aligned'', thus making it difficult to explain…

Computation and Language · Computer Science 2024-01-05 Andrew Lee , Xiaoyan Bai , Itamar Pres , Martin Wattenberg , Jonathan K. Kummerfeld , Rada Mihalcea

Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only

Supervised fine-tuning (SFT) has emerged as a crucial method for aligning large language models (LLMs) with human-annotated demonstrations. However, SFT, being an off-policy approach similar to behavior cloning, often struggles with…

Computation and Language · Computer Science 2025-10-27 Qingru Zhang , Liang Qiu , Ilgee Hong , Zhenghao Xu , Tianyi Liu , Shiyang Li , Rongzhi Zhang , Zheng Li , Lihong Li , Bing Yin , Chao Zhang , Jianshu Chen , Haoming Jiang , Tuo Zhao

MO-GRPO: Mitigating Reward Hacking of Group Relative Policy Optimization on Multi-Objective Problems

Group Relative Policy Optimization (GRPO) has been shown to be an effective algorithm when an accurate reward model is available. However, such a highly reliable reward model is not available in many real-world tasks. In this paper, we…

Machine Learning · Computer Science 2026-01-12 Yuki Ichihara , Yuu Jinnai , Tetsuro Morimura , Mitsuki Sakamoto , Ryota Mitsuhashi , Eiji Uchibe

Monitoring Emergent Reward Hacking During Generation via Internal Activations

Fine-tuned large language models can exhibit reward-hacking behavior arising from emergent misalignment, which is difficult to detect from final outputs alone. While prior work has studied reward hacking at the level of completed responses,…

Computation and Language · Computer Science 2026-03-05 Patrick Wilhelm , Thorsten Wittkopp , Odej Kao

Taming Overconfidence in LLMs: Reward Calibration in RLHF

Language model calibration refers to the alignment between the confidence of the model and the actual performance of its responses. While previous studies point out the overconfidence phenomenon in Large Language Models (LLMs) and show that…

Computation and Language · Computer Science 2025-03-04 Jixuan Leng , Chengsong Huang , Banghua Zhu , Jiaxin Huang

Fine-Tuning Language Models with Advantage-Induced Policy Alignment

Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach to aligning large language models (LLMs) to human preferences. Among the plethora of RLHF techniques, proximal policy optimization (PPO) is of the most…

Computation and Language · Computer Science 2023-11-06 Banghua Zhu , Hiteshi Sharma , Felipe Vieira Frujeri , Shi Dong , Chenguang Zhu , Michael I. Jordan , Jiantao Jiao

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing…

Machine Learning · Computer Science 2024-07-31 Rafael Rafailov , Archit Sharma , Eric Mitchell , Stefano Ermon , Christopher D. Manning , Chelsea Finn

Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models

Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when…

Computation and Language · Computer Science 2025-05-20 Zae Myung Kim , Chanwoo Park , Vipul Raheja , Suin Kim , Dongyeop Kang

One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning is vulnerable to reward hacking, whereby LM policies learn undesirable behaviors from flawed RMs. By…

Computation and Language · Computer Science 2026-03-05 Daniel Fein , Max Lamparth , Violet Xiang , Mykel J. Kochenderfer , Nick Haber

Proximal Policy Optimization Actual Combat: Manipulating Output Tokenizer Length

The Reinforcement Learning from Human Feedback (RLHF) plays a pivotal role in shaping the impact of large language models (LLMs), contributing significantly to controlling output toxicity and selecting output styles, particularly as LLMs…

Artificial Intelligence · Computer Science 2023-08-11 Miao Fan , Chen Hu , Shuchang Zhou

Integrating Pretrained Language Model for Dialogue Policy Learning

Reinforcement Learning (RL) has been witnessed its potential for training a dialogue policy agent towards maximizing the accumulated rewards given from users. However, the reward can be very sparse for it is usually only provided at the end…

Computation and Language · Computer Science 2021-11-03 Hongru Wang , Huimin Wang , Zezhong Wang , Kam-Fai Wong

Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models

Preference optimization has become a central paradigm for aligning large language models with human feedback. Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback by directly optimizing pairwise…

Machine Learning · Computer Science 2026-05-05 Inoussa Mouiche

Learning Rewards from Linguistic Feedback

We explore unconstrained natural language feedback as a learning signal for artificial agents. Humans use rich and varied language to teach, yet most prior work on interactive learning from language assumes a particular form of input (e.g.,…

Artificial Intelligence · Computer Science 2021-07-06 Theodore R. Sumers , Mark K. Ho , Robert D. Hawkins , Karthik Narasimhan , Thomas L. Griffiths

Manipulating Transformer-Based Models: Controllability, Steerability, and Robust Interventions

Transformer-based language models excel in NLP tasks, but fine-grained control remains challenging. This paper explores methods for manipulating transformer models through principled interventions at three levels: prompts, activations, and…

Computation and Language · Computer Science 2025-09-08 Faruk Alpay , Taylan Alpay

Policy Filtration for RLHF to Mitigate Noise in Reward Models

While direct policy optimization methods exist, pioneering LLMs are fine-tuned with reinforcement learning from human feedback (RLHF) to generate better responses under the supervision of a reward model learned from preference data. One…

Machine Learning · Computer Science 2025-06-10 Chuheng Zhang , Wei Shen , Li Zhao , Xuyun Zhang , Xiaolong Xu , Wanchun Dou , Jiang Bian

Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning

Instruction-fine-tuned large language models (LLMs) under 14B parameters continue to underperform on natural language understanding (NLU) tasks, often trailing smaller models like BERT-base on benchmarks such as GLUE and SuperGLUE.…

Computation and Language · Computer Science 2025-09-29 Bokai Hu , Sai Ashish Somayajula , Xin Pan , Pengtao Xie