Related papers: Representation Tuning

Improving Instruction-Following in Language Models through Activation Steering

The ability to follow instructions is crucial for numerous real-world applications of language models. In pursuit of deeper insights and more powerful capabilities, we derive instruction-specific vector representations from language models…

Computation and Language · Computer Science 2025-04-15 Alessandro Stolfo , Vidhisha Balachandran , Safoora Yousefi , Eric Horvitz , Besmira Nushi

Improving Reasoning Performance in Large Language Models via Representation Engineering

Recent advancements in large language models (LLMs) have resulted in increasingly anthropomorphic language concerning the ability of LLMs to reason. Whether reasoning in LLMs should be understood to be inherently different is, however,…

Machine Learning · Computer Science 2025-07-28 Bertram Højer , Oliver Jarvis , Stefan Heinrich

Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural Representations

Changing the behavior of large language models (LLMs) can be as straightforward as editing the Transformer's residual streams using appropriately constructed "steering vectors." These modifications to internal neural activations, a form of…

Computation and Language · Computer Science 2025-05-20 Jian-Qiao Zhu , Haijiang Yan , Thomas L. Griffiths

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

The honesty of large language models (LLMs) is a critical alignment challenge, especially as advanced systems with chain-of-thought (CoT) reasoning may strategically deceive humans. Unlike traditional honesty issues on LLMs, which could be…

Artificial Intelligence · Computer Science 2025-06-06 Kai Wang , Yihao Zhang , Meng Sun

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

Inference-time LLM alignment methods, particularly activation steering, offer an alternative to fine-tuning by directly modifying activations during generation. Existing methods, however, often rely on non-anticipative interventions that…

Machine Learning · Computer Science 2026-04-22 Julian Skifstad , Xinyue Annie Yang , Glen Chou

Identifying and Manipulating Personality Traits in LLMs Through Activation Engineering

The field of large language models (LLMs) has grown rapidly in recent years, driven by the desire for better efficiency, interpretability, and safe use. Building on the novel approach of "activation engineering," this study explores…

Computation and Language · Computer Science 2025-08-26 Rumi Allbert , James K. Wiles , Vlad Grankovsky

Style Vectors for Steering Generative Large Language Model

This research explores strategies for steering the output of large language models (LLMs) towards specific styles, such as sentiment, emotion, or writing style, by adding style vectors to the activations of hidden layers during text…

Computation and Language · Computer Science 2024-02-05 Kai Konen , Sophie Jentzsch , Diaoulé Diallo , Peer Schütt , Oliver Bensch , Roxanne El Baff , Dominik Opitz , Tobias Hecking

Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering

Large language models have transformed AI, yet reliably controlling their outputs remains a challenge. This paper explores activation engineering, where outputs of pre-trained LLMs are controlled by manipulating their activations at…

Neural and Evolutionary Computing · Computer Science 2025-05-13 Joris Postmus , Steven Abreu

Towards Reliable Evaluation of Behavior Steering Interventions in LLMs

Representation engineering methods have recently shown promise for enabling efficient steering of model behavior. However, evaluation pipelines for these methods have primarily relied on subjective demonstrations, instead of quantitative,…

Artificial Intelligence · Computer Science 2024-10-23 Itamar Pres , Laura Ruis , Ekdeep Singh Lubana , David Krueger

Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs

As large language models (LLMs) become more integrated into societal systems, the risk of them perpetuating and amplifying harmful biases becomes a critical safety concern. Traditional methods for mitigating bias often rely on data…

Artificial Intelligence · Computer Science 2025-08-13 Shivam Dubey

Conversations: Love Them, Hate Them, Steer Them

Large Language Models (LLMs) demonstrate increasing conversational fluency, yet instilling them with nuanced, human-like emotional expression remains a significant challenge. Current alignment techniques often address surface-level output…

Computation and Language · Computer Science 2025-11-25 Niranjan Chebrolu , Gerard Christopher Yeo , Kokil Jaidka

Steering LLMs for Formal Theorem Proving

Recent advances in automated theorem proving use Large Language Models (LLMs) to translate informal mathematical statements into formal proofs. However, informal cues are often ambiguous or lack strict logical structure, making it hard for…

Machine Learning · Computer Science 2025-10-14 Shashank Kirtania , Arun Iyer

Improving Multilingual Language Models by Aligning Representations through Steering

This paper investigates how Large Language Models (LLMs) represent non-English tokens -- a question that remains underexplored despite recent progress. We propose a lightweight intervention method using representation steering, where a…

Computation and Language · Computer Science 2025-08-27 Omar Mahmoud , Buddhika Laknath Semage , Thommen George Karimpanal , Santu Rana

A Timeline and Analysis for Representation Plasticity in Large Language Models

The ability to steer AI behavior is crucial to preventing its long term dangerous and catastrophic potential. Representation Engineering (RepE) has emerged as a novel, powerful method to steer internal model behaviors, such as "honesty", at…

Machine Learning · Computer Science 2024-10-10 Akshat Kannan

What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

Applying steering vectors to large language models (LLMs) is an efficient and effective model alignment technique, but we lack an interpretable explanation for how it works-- specifically, what internal mechanisms steering vectors affect…

Machine Learning · Computer Science 2026-04-10 Stephen Cheng , Sarah Wiegreffe , Dinesh Manocha

LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models

Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs. Recently, activation steering, a technique that modulates LLMs' behaviours by adjusting their latent…

Computation and Language · Computer Science 2025-01-23 Jingyuan Yang , Rongjun Li , Weixuan Wang , Ziyu Zhou , Zhiyong Feng , Wei Peng

Representation Bending for Large Language Model Safety

Large Language Models (LLMs) have emerged as powerful tools, but their inherent safety risks - ranging from harmful content generation to broader societal harms - pose significant challenges. These risks can be amplified by the recent…

Machine Learning · Computer Science 2025-07-16 Ashkan Yousefpour , Taeheon Kim , Ryan S. Kwon , Seungbeen Lee , Wonje Jeung , Seungju Han , Alvin Wan , Harrison Ngan , Youngjae Yu , Jonghyun Choi

Representational Curvature Modulates Behavioral Uncertainty in Large Language Models

In autoregressive large language models (LLMs), temporal straightening offers an account of how the next-token prediction objective shapes representations. Models learn to progressively straighten the representational trajectory of input…

Artificial Intelligence · Computer Science 2026-04-28 Jack King , Evelina Fedorenko , Eghbal A. Hosseini

Patterns and Mechanisms of Contrastive Activation Engineering

Controlling the behavior of Large Language Models (LLMs) remains a significant challenge due to their inherent complexity and opacity. While techniques like fine-tuning can modify model behavior, they typically require extensive…

Artificial Intelligence · Computer Science 2025-05-07 Yixiong Hao , Ayush Panda , Stepan Shabalin , Sheikh Abdur Raheem Ali

The Effectiveness of Style Vectors for Steering Large Language Models: A Human Evaluation

Controlling the behavior of large language models (LLMs) at inference time is essential for aligning outputs with human abilities and safety requirements. \emph{Activation steering} provides a lightweight alternative to prompt engineering…

Artificial Intelligence · Computer Science 2026-01-30 Diaoulé Diallo , Katharina Dworatzyk , Sophie Jentzsch , Peer Schütt , Sabine Theis , Tobias Hecking