Related papers: Steered Generation via Gradient-Based Optimization…

Steered Generation via Gradient Descent on Sparse Features

Large language models (LLMs) encode a diverse range of linguistic features within their latent representations, which can be harnessed to steer their output toward specific target characteristics. In this paper, we modify the internal…

Computation and Language · Computer Science 2025-02-27 Sumanta Bhattacharyya , Pedram Rooshenas

SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but controlling their behavior reliably remains challenging, especially in open-ended generation settings. This paper…

Computation and Language · Computer Science 2025-12-08 Zirui He , Mingyu Jin , Bo Shen , Ali Payani , Yongfeng Zhang , Mengnan Du

Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders

Sparse Autoencoders (SAEs) have recently emerged as powerful tools for interpreting and steering the internal representations of large language models (LLMs). However, conventional approaches to analyzing SAEs typically rely solely on…

Machine Learning · Computer Science 2025-09-24 Dong Shu , Xuansheng Wu , Haiyan Zhao , Mengnan Du , Ninghao Liu

Steering Large Language Model Activations in Sparse Spaces

A key challenge in AI alignment is guiding large language models (LLMs) to follow desired behaviors at test time. Activation steering, which modifies internal model activations during inference, offers a potential solution. However, prior…

Machine Learning · Computer Science 2025-03-04 Reza Bayat , Ali Rahimi-Kalahroudi , Mohammad Pezeshki , Sarath Chandar , Pascal Vincent

A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering

Sparse autoencoders (SAEs) have recently emerged as a powerful tool for language model steering. Prior work has explored top-k SAE latents for steering, but we observe that many dimensions among the top-k latents capture non-semantic…

Computation and Language · Computer Science 2025-10-03 Jiaqing Xie

Controllable LLM Reasoning via Sparse Autoencoder-Based Steering

Large Reasoning Models (LRMs) exhibit human-like cognitive reasoning strategies (e.g. backtracking, cross-verification) during reasoning process, which improves their performance on complex tasks. Currently, reasoning strategies are…

Artificial Intelligence · Computer Science 2026-01-08 Yi Fang , Wenjie Wang , Mingfeng Xue , Boyi Deng , Fengli Xu , Dayiheng Liu , Fuli Feng

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses. Understanding their internal states is crucial for understanding their successes, diagnosing their failures,…

Computation and Language · Computer Science 2025-02-24 Xuansheng Wu , Jiayi Yuan , Wenlin Yao , Xiaoming Zhai , Ninghao Liu

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering…

Machine Learning · Computer Science 2025-04-02 Jeffrey Olmo , Jared Wilson , Max Forsey , Bryce Hepner , Thomas Vin Howe , David Wingate

Step-Level Sparse Autoencoder for Reasoning Process Interpretation

Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as…

Machine Learning · Computer Science 2026-03-04 Xuan Yang , Jiayu Liu , Yuhang Lai , Hao Xu , Zhenya Huang , Ning Miao

Interpretable LLM Guardrails via Sparse Representation Steering

Large language models (LLMs) exhibit impressive capabilities in generation tasks but are prone to producing harmful, misleading, or biased content, posing significant ethical and safety concerns. To mitigate such risks, representation…

Cryptography and Security · Computer Science 2025-11-17 Zeqing He , Zhibo Wang , Huiyu Xu , Hejun Lin , Wenhui Zhang , Zhixuan Chu

LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models

Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs. Recently, activation steering, a technique that modulates LLMs' behaviours by adjusting their latent…

Computation and Language · Computer Science 2025-01-23 Jingyuan Yang , Rongjun Li , Weixuan Wang , Ziyu Zhou , Zhiyong Feng , Wei Peng

SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models

The ability of large language models (LLMs) to follow instructions is crucial for their practical applications, yet the underlying mechanisms remain poorly understood. This paper presents a novel framework that leverages sparse autoencoders…

Machine Learning · Computer Science 2025-02-18 Zirui He , Haiyan Zhao , Yiran Qiao , Fan Yang , Ali Payani , Jing Ma , Mengnan Du

Causal Language Control in Multilingual Transformers via Sparse Feature Steering

Deterministically controlling the target generation language of large multilingual language models (LLMs) remains a fundamental challenge, particularly in zero-shot settings where neither explicit language prompts nor fine-tuning are…

Computation and Language · Computer Science 2025-10-17 Cheng-Ting Chou , George Liu , Jessica Sun , Cole Blondin , Kevin Zhu , Vasu Sharma , Sean O'Brien

Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection

Sparse autoencoders (SAEs) enable feature-level mechanistic interpretability and activation steering in large language models (LLMs), but SAE-based language control remains unreliable in multilingual settings: most SAEs are trained on…

Computation and Language · Computer Science 2026-05-25 Yusser Al Ghussin , Daniil Gurgurov , Tanja Baeumel , Josef van Genabith , Patrick Schramowski , Simon Ostermann

Efficient and accurate steering of Large Language Models through attention-guided feature learning

Steering, or direct manipulation of internal activations to guide LLM responses toward specific semantic concepts, is emerging as a promising avenue for both understanding how semantic concepts are stored within LLMs and advancing LLM…

Machine Learning · Computer Science 2026-02-03 Parmida Davarmanesh , Ashia Wilson , Adityanarayanan Radhakrishnan

Sparse Shift Autoencoders for Identifying Concepts from Large Language Model Activations

Unsupervised approaches to large language model (LLM) interpretability, such as sparse autoencoders (SAEs), offer a way to decode LLM activations into interpretable and, ideally, controllable concepts. On the one hand, these approaches…

Machine Learning · Computer Science 2026-03-03 Shruti Joshi , Andrea Dittadi , Sébastien Lachapelle , Dhanya Sridhar

Improving Sparse Autoencoder with Dynamic Attention

Recently, sparse autoencoders (SAEs) have emerged as a promising technique for interpreting activations in foundation models by disentangling features into a sparse set of concepts. However, identifying the optimal level of sparsity for…

Machine Learning · Computer Science 2026-04-17 Dongsheng Wang , Jinsen Zhang , Dawei Su , Hui Huang

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

Sparse Autoencoders (SAEs) can extract interpretable features from large language models (LLMs) without supervision. However, their effectiveness in downstream steering tasks is limited by the requirement for contrastive datasets or large…

Computation and Language · Computer Science 2026-05-05 Seonglae Cho , Zekun Wu , Adriano Koshiyama

Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms

Precise control over language model generation is vital for ensuring both safety and reliability. Although prompt engineering and steering are commonly used to intervene in model behaviors, the vast number of parameters in models often…

Computation and Language · Computer Science 2025-06-04 Mengru Wang , Ziwen Xu , Shengyu Mao , Shumin Deng , Zhaopeng Tu , Huajun Chen , Ningyu Zhang

Improving Dictionary Learning with Gated Sparse Autoencoders

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We…

Machine Learning · Computer Science 2024-05-01 Senthooran Rajamanoharan , Arthur Conmy , Lewis Smith , Tom Lieberum , Vikrant Varma , János Kramár , Rohin Shah , Neel Nanda