English
Related papers

Related papers: Programming Refusal with Conditional Activation St…

200 papers

Large language models (LLMs) have shown impressive abilities in leveraging pretrained knowledge through prompting, but they often struggle with unseen tasks, particularly in data-scarce scenarios. While cross-task in-context learning offers…

Computation and Language · Computer Science 2025-07-18 Xinyu Tang , Zhihao Lv , Xiaoxue Cheng , Junyi Li , Wayne Xin Zhao , Zujie Wen , Zhiqiang Zhang , Jun Zhou

Large language models (LLMs) exhibit reasoning biases, often conflating content plausibility with formal logical validity. This can lead to wrong inferences in critical domains, where plausible arguments are incorrectly deemed logically…

Artificial Intelligence · Computer Science 2026-04-02 Marco Valentino , Geonhee Kim , Dhairya Dalal , Zhixue Zhao , André Freitas

Linear activation steering is a powerful approach for eliciting the capabilities of large language models and specializing their behavior using limited labeled data. While effective, existing methods often apply a fixed steering strength to…

Computation and Language · Computer Science 2026-04-28 Brandon Hsu , Daniel Beaglehole , Adityanarayanan Radhakrishnan , Mikhail Belkin

As large language models (LLMs) become more integrated into societal systems, the risk of them perpetuating and amplifying harmful biases becomes a critical safety concern. Traditional methods for mitigating bias often rely on data…

Artificial Intelligence · Computer Science 2025-08-13 Shivam Dubey

Large language models (LLMs) require precise behavior control for safe and effective deployment across diverse applications. Activation steering offers a promising approach for LLMs' behavioral control. We focus on the question of how…

Artificial Intelligence · Computer Science 2026-01-13 Tetiana Bas , Krystian Novak

As LLMs are increasingly deployed in real-world applications, ensuring their ability to refuse malicious prompts, especially jailbreak attacks, is essential for safe and reliable use. Recently, activation steering has emerged as an…

Machine Learning · Computer Science 2026-02-10 Leheng Sheng , Changshuo Shen , Weixiang Zhao , Junfeng Fang , Xiaohao Liu , Zhenkai Liang , Xiang Wang , An Zhang , Tat-Seng Chua

The rise of large language models (LLMs) has prompted increasing interest in their use as in-context learning agents. At the core of agentic behavior is the capacity for exploration, or the ability to actively gather information about the…

Computation and Language · Computer Science 2024-10-14 Nate Rahn , Pierluca D'Oro , Marc G. Bellemare

We propose affine concept editing (ACE) as an approach for steering language models' behavior by intervening directly in activations. We begin with an affine decomposition of model activation vectors and show that prior methods for steering…

Machine Learning · Computer Science 2025-01-29 Thomas Marshall , Adam Scherlis , Nora Belrose

Controlling undesirable Large Language Model (LLM) behaviors, such as the generation of unsafe content or failing to adhere to safety guidelines, often relies on costly fine-tuning. Activation steering provides an alternative for…

Computation and Language · Computer Science 2026-03-17 Amr Hegazy , Mostafa Elhoushi , Amr Alanwar

Activation-based control steers large language models (LLMs) by intervening on their internal representations during inference, and has emerged as an effective paradigm for controlling behaviors such as persona and style. However, existing…

Computation and Language · Computer Science 2026-05-29 Yingdong Shi , Ruiming Zhang , Changming Li , Zhiyu Yang , Kaixing Zhang , Jingyi Yu , Kan Ren

With the rapid advancement of Vision Language Models (VLMs), refusal mechanisms have become a critical component for ensuring responsible and safe model behavior. However, existing refusal strategies are largely \textit{one-size-fits-all}…

Computer Vision and Pattern Recognition · Computer Science 2026-02-10 Jiaxi Yang , Shicheng Liu , Yuchen Yang , Dongwon Lee

Large Language Models (LLMs), despite advances in instruction tuning, often fail to follow complex user instructions. Activation steering techniques aim to mitigate this by manipulating model internals, but have a potential risk of…

Machine Learning · Computer Science 2026-03-10 Minjae Kang , Jaehyung Kim

We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language Models refusal behaviour on politically sensitive topics without retraining. We replace fragile pattern-based refusal detection…

Computation and Language · Computer Science 2026-02-25 Iker García-Ferrero , David Montero , Roman Orus

Despite significant progress in alignment, large language models (LLMs) remain vulnerable to adversarial attacks that elicit harmful behaviors. Activation steering techniques offer a promising inference-time intervention approach, but…

Machine Learning · Computer Science 2026-01-28 Quy-Anh Dang , Chris Ngo

Fine-tuning large language models (LLMs) to adapt to evolving safety policies is costly and impractical. Mechanistic interpretability enables inference-time control through latent activation steering, yet its potential for precise,…

Machine Learning · Computer Science 2025-06-06 Shaona Ghosh , Amrita Bhattacharjee , Yftah Ziser , Christopher Parisien

Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in its behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating…

Artificial Intelligence · Computer Science 2026-05-11 Aayush Mishra , Daniel Khashabi , Anqi Liu

Activation steering is a promising technique for controlling LLM behavior by adding semantically meaningful vectors directly into a model's hidden states during inference. It is often framed as a precise, interpretable, and potentially…

Machine Learning · Computer Science 2026-02-17 Anton Korznikov , Andrey Galichin , Alexey Dontsov , Oleg Y. Rogov , Ivan Oseledets , Elena Tutubalina

Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches face a fundamental trade-off: sample-efficient methods suboptimally capture steering signals from…

Machine Learning · Computer Science 2026-03-09 Kartik Sharma , Rakshit S. Trivedi

Large reasoning models (LRMs) generate chain-of-thought (CoT) traces before producing final outputs, introducing a dynamic internal state that may complicate control mechanisms such as refusal. Unlike instruction-tuned LLMs, where refusal…

Artificial Intelligence · Computer Science 2026-05-27 Kia-Jüng Yang , Dominik Meier , Jiachen Zhao , Terry Ruas , Bela Gipp

This work examines whether activating latent subspaces in language models (LLMs) can steer scientific code generation toward a specific programming language. Five causal LLMs were first evaluated on scientific coding prompts to quantify…

Artificial Intelligence · Computer Science 2025-06-24 Vansh Sharma , Venkat Raman
‹ Prev 1 2 3 10 Next ›