Related papers: Likelihood hacking in probabilistic program synthe…

Exploration Hacking: Can LLMs Learn to Resist RL Training?

Reinforcement learning (RL) has become essential to the post-training of large language models (LLMs) for reasoning, agentic capabilities and alignment. Successful RL relies on sufficient exploration of diverse actions by the model during…

Machine Learning · Computer Science 2026-05-01 Eyon Jang , Damon Falck , Joschka Braun , Nathalie Kirch , Achu Menon , Perusha Moodley , Scott Emmons , Roland S. Zimmermann , David Lindner

Detecting Malicious Source Code in PyPI Packages with LLMs: Does RAG Come in Handy?

Malicious software packages in open-source ecosystems, such as PyPI, pose growing security risks. Unlike traditional vulnerabilities, these packages are intentionally designed to deceive users, making detection challenging due to evolving…

Software Engineering · Computer Science 2025-04-21 Motunrayo Ibiyo , Thinakone Louangdy , Phuong T. Nguyen , Claudio Di Sipio , Davide Di Ruscio

Safe Reinforcement Learning via Probabilistic Logic Shields

Safe Reinforcement learning (Safe RL) aims at learning optimal policies while staying safe. A popular solution to Safe RL is shielding, which uses a logical safety specification to prevent an RL agent from taking unsafe actions. However,…

Artificial Intelligence · Computer Science 2023-03-07 Wen-Chi Yang , Giuseppe Marra , Gavin Rens , Luc De Raedt

RefineStat: Efficient Exploration for Probabilistic Program Synthesis

Probabilistic programming offers a powerful framework for modeling uncertainty, yet statistical model discovery in this domain entails navigating an immense search space under strict domain-specific constraints. When small language models…

Machine Learning · Computer Science 2026-04-21 Madhav Kanda , Shubham Ugare , Sasa Misailovic

Prompt-Hacking: The New p-Hacking?

As Large Language Models (LLMs) become increasingly embedded in empirical research workflows, their use as analytical tools for quantitative or qualitative data raises pressing concerns for scientific integrity. This opinion paper draws a…

Human-Computer Interaction · Computer Science 2025-08-12 Thomas Kosch , Sebastian Feger

Are PPO-ed Language Models Hackable?

Numerous algorithms have been proposed to $\textit{align}$ language models to remove undesirable behaviors. However, the challenges associated with a very large state space and creating a proper reward function often result in various…

Computation and Language · Computer Science 2024-06-06 Suraj Anand , David Getzen

When Intelligence Fails: An Empirical Study on Why LLMs Struggle with Password Cracking

The remarkable capabilities of Large Language Models (LLMs) in natural language understanding and generation have sparked interest in their potential for cybersecurity applications, including password guessing. In this study, we conduct an…

Cryptography and Security · Computer Science 2026-01-01 Mohammad Abdul Rehman , Syed Imad Ali Shah , Abbas Anwar , Noor Islam , Hamid Khan

flip-hoisting: Exploiting Repeated Parameters in Discrete Probabilistic Programs

Many of today's probabilistic programming languages (PPLs) have brittle inference performance: the performance of the underlying inference algorithm is very sensitive to the precise way in which the probabilistic program is written. A…

Artificial Intelligence · Computer Science 2023-02-22 Ellie Y. Cheng , Todd Millstein , Guy Van den Broeck , Steven Holtzen

Stabilizing RLHF through Advantage Model and Selective Rehearsal

Large Language Models (LLMs) have revolutionized natural language processing, yet aligning these models with human values and preferences using RLHF remains a significant challenge. This challenge is characterized by various instabilities,…

Computation and Language · Computer Science 2023-09-20 Baolin Peng , Linfeng Song , Ye Tian , Lifeng Jin , Haitao Mi , Dong Yu

Paradoxes of Probabilistic Programming

Probabilistic programming languages allow programmers to write down conditional probability distributions that represent statistical and machine learning models as programs that use observe statements. These programs are run by accumulating…

Programming Languages · Computer Science 2021-01-25 Jules Jacobs

Mitigating Preference Hacking in Policy Optimization with Pessimism

This work tackles the problem of overoptimization in reinforcement learning from human feedback (RLHF), a prevalent technique for aligning models with human preferences. RLHF relies on reward or preference models trained on \emph{fixed…

Machine Learning · Computer Science 2025-03-11 Dhawal Gupta , Adam Fisch , Christoph Dann , Alekh Agarwal

Implicit Unlikelihood Training: Improving Neural Text Generation with Reinforcement Learning

Likelihood training and maximization-based decoding result in dull and repetitive generated texts even when using powerful language models (Holtzman et al., 2019). Adding a loss function for regularization was shown to improve text…

Computation and Language · Computer Science 2021-01-13 Evgeny Lagutin , Daniil Gavrilov , Pavel Kalaidin

Language Models Learn to Mislead Humans via RLHF

Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most popular post-training method, may exacerbate this problem: to achieve higher rewards, LMs might get better at…

Computation and Language · Computer Science 2024-12-10 Jiaxin Wen , Ruiqi Zhong , Akbir Khan , Ethan Perez , Jacob Steinhardt , Minlie Huang , Samuel R. Bowman , He He , Shi Feng

Safe Reinforcement Learning via Probabilistic Shields

This paper targets the efficient construction of a safety shield for decision making in scenarios that incorporate uncertainty. Markov decision processes (MDPs) are prominent models to capture such planning problems. Reinforcement learning…

Artificial Intelligence · Computer Science 2019-11-26 Nils Jansen , Bettina Könighofer , Sebastian Junges , Alexandru C. Serban , Roderick Bloem

Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models

Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing (NLP) tasks in recent years. However, their susceptibility to jailbreaks and perturbations necessitates additional…

Computation and Language · Computer Science 2025-06-10 Maciej Chrabąszcz , Katarzyna Lorenc , Karolina Seweryn

Reinforcement Learning with $\omega$-Regular Objectives and Constraints

Reinforcement learning (RL) commonly relies on scalar rewards with limited ability to express temporal, conditional, or safety-critical goals, and can lead to reward hacking. Temporal logic expressible via the more general class of…

Artificial Intelligence · Computer Science 2025-11-26 Dominik Wagner , Leon Witzman , Luke Ong

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

Reward hacking--where agents exploit flaws in imperfect reward functions rather than performing tasks as intended--poses risks for AI alignment. Reward hacking has been observed in real training runs, with coding agents learning to…

Artificial Intelligence · Computer Science 2025-08-26 Mia Taylor , James Chua , Jan Betley , Johannes Treutlein , Owain Evans

Natural Emergent Misalignment from Reward Hacking in Production RL

We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic…

Artificial Intelligence · Computer Science 2025-11-25 Monte MacDiarmid , Benjamin Wright , Jonathan Uesato , Joe Benton , Jon Kutasov , Sara Price , Naia Bouscal , Sam Bowman , Trenton Bricken , Alex Cloud , Carson Denison , Johannes Gasteiger , Ryan Greenblatt , Jan Leike , Jack Lindsey , Vlad Mikulik , Ethan Perez , Alex Rodrigues , Drake Thomas , Albert Webson , Daniel Ziegler , Evan Hubinger

Modular Probabilistic Models via Algebraic Effects

Probabilistic programming languages (PPLs) allow programmers to construct statistical models and then simulate data or perform inference over them. Many PPLs restrict models to a particular instance of simulation or inference, limiting…

Programming Languages · Computer Science 2024-12-24 Minh Nguyen , Roly Perera , Meng Wang , Nicolas Wu

Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes

Reinforcement learning (RL) has proven remarkably effective at improving the accuracy of language models in verifiable and deterministic domains like mathematics. Here, we examine if current RL methods are also effective at optimizing…

Machine Learning · Computer Science 2025-08-19 Michael Bereket , Jure Leskovec