English
Related papers

Related papers: Likelihood hacking in probabilistic program synthe…

200 papers

Reinforcement learning (RL) has become essential to the post-training of large language models (LLMs) for reasoning, agentic capabilities and alignment. Successful RL relies on sufficient exploration of diverse actions by the model during…

Malicious software packages in open-source ecosystems, such as PyPI, pose growing security risks. Unlike traditional vulnerabilities, these packages are intentionally designed to deceive users, making detection challenging due to evolving…

Software Engineering · Computer Science 2025-04-21 Motunrayo Ibiyo , Thinakone Louangdy , Phuong T. Nguyen , Claudio Di Sipio , Davide Di Ruscio

Safe Reinforcement learning (Safe RL) aims at learning optimal policies while staying safe. A popular solution to Safe RL is shielding, which uses a logical safety specification to prevent an RL agent from taking unsafe actions. However,…

Artificial Intelligence · Computer Science 2023-03-07 Wen-Chi Yang , Giuseppe Marra , Gavin Rens , Luc De Raedt

Probabilistic programming offers a powerful framework for modeling uncertainty, yet statistical model discovery in this domain entails navigating an immense search space under strict domain-specific constraints. When small language models…

Machine Learning · Computer Science 2026-04-21 Madhav Kanda , Shubham Ugare , Sasa Misailovic

As Large Language Models (LLMs) become increasingly embedded in empirical research workflows, their use as analytical tools for quantitative or qualitative data raises pressing concerns for scientific integrity. This opinion paper draws a…

Human-Computer Interaction · Computer Science 2025-08-12 Thomas Kosch , Sebastian Feger

Numerous algorithms have been proposed to $\textit{align}$ language models to remove undesirable behaviors. However, the challenges associated with a very large state space and creating a proper reward function often result in various…

Computation and Language · Computer Science 2024-06-06 Suraj Anand , David Getzen

The remarkable capabilities of Large Language Models (LLMs) in natural language understanding and generation have sparked interest in their potential for cybersecurity applications, including password guessing. In this study, we conduct an…

Cryptography and Security · Computer Science 2026-01-01 Mohammad Abdul Rehman , Syed Imad Ali Shah , Abbas Anwar , Noor Islam , Hamid Khan

Many of today's probabilistic programming languages (PPLs) have brittle inference performance: the performance of the underlying inference algorithm is very sensitive to the precise way in which the probabilistic program is written. A…

Artificial Intelligence · Computer Science 2023-02-22 Ellie Y. Cheng , Todd Millstein , Guy Van den Broeck , Steven Holtzen

Large Language Models (LLMs) have revolutionized natural language processing, yet aligning these models with human values and preferences using RLHF remains a significant challenge. This challenge is characterized by various instabilities,…

Computation and Language · Computer Science 2023-09-20 Baolin Peng , Linfeng Song , Ye Tian , Lifeng Jin , Haitao Mi , Dong Yu

Probabilistic programming languages allow programmers to write down conditional probability distributions that represent statistical and machine learning models as programs that use observe statements. These programs are run by accumulating…

Programming Languages · Computer Science 2021-01-25 Jules Jacobs

This work tackles the problem of overoptimization in reinforcement learning from human feedback (RLHF), a prevalent technique for aligning models with human preferences. RLHF relies on reward or preference models trained on \emph{fixed…

Machine Learning · Computer Science 2025-03-11 Dhawal Gupta , Adam Fisch , Christoph Dann , Alekh Agarwal

Likelihood training and maximization-based decoding result in dull and repetitive generated texts even when using powerful language models (Holtzman et al., 2019). Adding a loss function for regularization was shown to improve text…

Computation and Language · Computer Science 2021-01-13 Evgeny Lagutin , Daniil Gavrilov , Pavel Kalaidin

Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most popular post-training method, may exacerbate this problem: to achieve higher rewards, LMs might get better at…

Computation and Language · Computer Science 2024-12-10 Jiaxin Wen , Ruiqi Zhong , Akbir Khan , Ethan Perez , Jacob Steinhardt , Minlie Huang , Samuel R. Bowman , He He , Shi Feng

This paper targets the efficient construction of a safety shield for decision making in scenarios that incorporate uncertainty. Markov decision processes (MDPs) are prominent models to capture such planning problems. Reinforcement learning…

Artificial Intelligence · Computer Science 2019-11-26 Nils Jansen , Bettina Könighofer , Sebastian Junges , Alexandru C. Serban , Roderick Bloem

Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing (NLP) tasks in recent years. However, their susceptibility to jailbreaks and perturbations necessitates additional…

Computation and Language · Computer Science 2025-06-10 Maciej Chrabąszcz , Katarzyna Lorenc , Karolina Seweryn

Reinforcement learning (RL) commonly relies on scalar rewards with limited ability to express temporal, conditional, or safety-critical goals, and can lead to reward hacking. Temporal logic expressible via the more general class of…

Artificial Intelligence · Computer Science 2025-11-26 Dominik Wagner , Leon Witzman , Luke Ong

Reward hacking--where agents exploit flaws in imperfect reward functions rather than performing tasks as intended--poses risks for AI alignment. Reward hacking has been observed in real training runs, with coding agents learning to…

Artificial Intelligence · Computer Science 2025-08-26 Mia Taylor , James Chua , Jan Betley , Johannes Treutlein , Owain Evans

We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic…

Probabilistic programming languages (PPLs) allow programmers to construct statistical models and then simulate data or perform inference over them. Many PPLs restrict models to a particular instance of simulation or inference, limiting…

Programming Languages · Computer Science 2024-12-24 Minh Nguyen , Roly Perera , Meng Wang , Nicolas Wu

Reinforcement learning (RL) has proven remarkably effective at improving the accuracy of language models in verifiable and deterministic domains like mathematics. Here, we examine if current RL methods are also effective at optimizing…

Machine Learning · Computer Science 2025-08-19 Michael Bereket , Jure Leskovec
‹ Prev 1 2 3 10 Next ›