Related papers: Programming Refusal with Conditional Activation St…

Enhancing Cross-task Transfer of Large Language Models via Activation Steering

Large language models (LLMs) have shown impressive abilities in leveraging pretrained knowledge through prompting, but they often struggle with unseen tasks, particularly in data-scarce scenarios. While cross-task in-context learning offers…

Computation and Language · Computer Science 2025-07-18 Xinyu Tang , Zhihao Lv , Xiaoxue Cheng , Junyi Li , Wayne Xin Zhao , Zujie Wen , Zhiqiang Zhang , Jun Zhou

Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering

Large language models (LLMs) exhibit reasoning biases, often conflating content plausibility with formal logical validity. This can lead to wrong inferences in critical domains, where plausible arguments are incorrectly deemed logically…

Artificial Intelligence · Computer Science 2026-04-02 Marco Valentino , Geonhee Kim , Dhairya Dalal , Zhixue Zhao , André Freitas

Contextual Linear Activation Steering of Language Models

Linear activation steering is a powerful approach for eliciting the capabilities of large language models and specializing their behavior using limited labeled data. While effective, existing methods often apply a fixed steering strength to…

Computation and Language · Computer Science 2026-04-28 Brandon Hsu , Daniel Beaglehole , Adityanarayanan Radhakrishnan , Mikhail Belkin

Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs

As large language models (LLMs) become more integrated into societal systems, the risk of them perpetuating and amplifying harmful biases becomes a critical safety concern. Traditional methods for mitigating bias often rely on data…

Artificial Intelligence · Computer Science 2025-08-13 Shivam Dubey

What Can We Actually Steer? A Multi-Behavior Study of Activation Control

Large language models (LLMs) require precise behavior control for safe and effective deployment across diverse applications. Activation steering offers a promising approach for LLMs' behavioral control. We focus on the question of how…

Artificial Intelligence · Computer Science 2026-01-13 Tetiana Bas , Krystian Novak

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint

As LLMs are increasingly deployed in real-world applications, ensuring their ability to refuse malicious prompts, especially jailbreak attacks, is essential for safe and reliable use. Recently, activation steering has emerged as an…

Machine Learning · Computer Science 2026-02-10 Leheng Sheng , Changshuo Shen , Weixiang Zhao , Junfeng Fang , Xiaohao Liu , Zhenkai Liang , Xiang Wang , An Zhang , Tat-Seng Chua

Controlling Large Language Model Agents with Entropic Activation Steering

The rise of large language models (LLMs) has prompted increasing interest in their use as in-context learning agents. At the core of agentic behavior is the capacity for exploration, or the ability to actively gather information about the…

Computation and Language · Computer Science 2024-10-14 Nate Rahn , Pierluca D'Oro , Marc G. Bellemare

Refusal in LLMs is an Affine Function

We propose affine concept editing (ACE) as an approach for steering language models' behavior by intervening directly in activations. We begin with an affine decomposition of model activation vectors and show that prior methods for steering…

Machine Learning · Computer Science 2025-01-29 Thomas Marshall , Adam Scherlis , Nora Belrose

Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs

Controlling undesirable Large Language Model (LLM) behaviors, such as the generation of unsafe content or failing to adhere to safety guidelines, often relies on costly fine-tuning. Activation steering provides an alternative for…

Computation and Language · Computer Science 2026-03-17 Amr Hegazy , Mostafa Elhoushi , Amr Alanwar

UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering

Activation-based control steers large language models (LLMs) by intervening on their internal representations during inference, and has emerged as an effective paradigm for controlling behaviors such as persona and style. However, existing…

Computation and Language · Computer Science 2026-05-29 Yingdong Shi , Ruiming Zhang , Changming Li , Zhiyu Yang , Kaixing Zhang , Jingyi Yu , Kan Ren

Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models

With the rapid advancement of Vision Language Models (VLMs), refusal mechanisms have become a critical component for ensuring responsible and safe model behavior. However, existing refusal strategies are largely \textit{one-size-fits-all}…

Computer Vision and Pattern Recognition · Computer Science 2026-02-10 Jiaxi Yang , Shicheng Liu , Yuchen Yang , Dongwon Lee

Enhancing Instruction Following of LLMs via Activation Steering with Dynamic Rejection

Large Language Models (LLMs), despite advances in instruction tuning, often fail to follow complex user instructions. Activation steering techniques aim to mitigate this by manipulating model internals, but have a potential risk of…

Machine Learning · Computer Science 2026-03-10 Minjae Kang , Jaehyung Kim

Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics

We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language Models refusal behaviour on politically sensitive topics without retraining. We replace fragile pattern-based refusal detection…

Computation and Language · Computer Science 2026-02-25 Iker García-Ferrero , David Montero , Roman Orus

Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection

Despite significant progress in alignment, large language models (LLMs) remain vulnerable to adversarial attacks that elicit harmful behaviors. Activation steering techniques offer a promising inference-time intervention approach, but…

Machine Learning · Computer Science 2026-01-28 Quy-Anh Dang , Chris Ngo

SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs

Fine-tuning large language models (LLMs) to adapt to evolving safety policies is costly and impractical. Mechanistic interpretability enables inference-time control through latent activation steering, yet its potential for precise,…

Machine Learning · Computer Science 2025-06-06 Shaona Ghosh , Amrita Bhattacharjee , Yftah Ziser , Christopher Parisien

Steered LLM Activations are Non-Surjective

Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in its behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating…

Artificial Intelligence · Computer Science 2026-05-11 Aayush Mishra , Daniel Khashabi , Anqi Liu

The Rogue Scalpel: Activation Steering Compromises LLM Safety

Activation steering is a promising technique for controlling LLM behavior by adding semantically meaningful vectors directly into a model's hidden states during inference. It is often framed as a precise, interpretable, and potentially…

Machine Learning · Computer Science 2026-02-17 Anton Korznikov , Andrey Galichin , Alexey Dontsov , Oleg Y. Rogov , Ivan Oseledets , Elena Tutubalina

COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics

Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches face a fundamental trade-off: sample-efficient methods suboptimally capture steering signals from…

Machine Learning · Computer Science 2026-03-09 Kartik Sharma , Rakshit S. Trivedi

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

Large reasoning models (LRMs) generate chain-of-thought (CoT) traces before producing final outputs, introducing a dynamic internal state that may complicate control mechanisms such as refusal. Unlike instruction-tuned LLMs, where refusal…

Artificial Intelligence · Computer Science 2026-05-27 Kia-Jüng Yang , Dominik Meier , Jiachen Zhao , Terry Ruas , Bela Gipp

Steering Conceptual Bias via Transformer Latent-Subspace Activation

This work examines whether activating latent subspaces in language models (LLMs) can steer scientific code generation toward a specific programming language. Five causal LLMs were first evaluated on scientific coding prompts to quantify…

Artificial Intelligence · Computer Science 2025-06-24 Vansh Sharma , Venkat Raman