Related papers: Enhancing LLM Steering through Sparse Autoencoder-…

SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but controlling their behavior reliably remains challenging, especially in open-ended generation settings. This paper…

Computation and Language · Computer Science 2025-12-08 Zirui He , Mingyu Jin , Bo Shen , Ali Payani , Yongfeng Zhang , Mengnan Du

Controllable LLM Reasoning via Sparse Autoencoder-Based Steering

Large Reasoning Models (LRMs) exhibit human-like cognitive reasoning strategies (e.g. backtracking, cross-verification) during reasoning process, which improves their performance on complex tasks. Currently, reasoning strategies are…

Artificial Intelligence · Computer Science 2026-01-08 Yi Fang , Wenjie Wang , Mingfeng Xue , Boyi Deng , Fengli Xu , Dayiheng Liu , Fuli Feng

Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

Linear concept vectors effectively steer LLMs, but existing methods suffer from noisy features in diverse datasets that undermine steering robustness. We propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectively keep the…

Computation and Language · Computer Science 2025-07-31 Haiyan Zhao , Xuansheng Wu , Fan Yang , Bo Shen , Ninghao Liu , Mengnan Du

Can sparse autoencoders be used to decompose and interpret steering vectors?

Steering vectors are a promising approach to control the behaviour of large language models. However, their underlying mechanisms remain poorly understood. While sparse autoencoders (SAEs) may offer a potential method to interpret steering…

Machine Learning · Computer Science 2024-11-14 Harry Mayne , Yushi Yang , Adam Mahdi

Improving Steering Vectors by Targeting Sparse Autoencoder Features

To control the behavior of language models, steering methods attempt to ensure that outputs of the model satisfy specific pre-defined properties. Adding steering vectors to the model is a promising method of model control that is easier…

Machine Learning · Computer Science 2024-11-22 Sviatoslav Chalnev , Matthew Siu , Arthur Conmy

Step-Level Sparse Autoencoder for Reasoning Process Interpretation

Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as…

Machine Learning · Computer Science 2026-03-04 Xuan Yang , Jiayu Liu , Yuhang Lai , Hao Xu , Zhenya Huang , Ning Miao

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

Sparse Autoencoders (SAEs) have recently gained attention as a means to improve the interpretability and steerability of Large Language Models (LLMs), both of which are essential for AI safety. In this work, we extend the application of…

Computer Vision and Pattern Recognition · Computer Science 2025-12-01 Mateusz Pach , Shyamgopal Karthik , Quentin Bouniot , Serge Belongie , Zeynep Akata

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

Sparse Autoencoders (SAEs) can extract interpretable features from large language models (LLMs) without supervision. However, their effectiveness in downstream steering tasks is limited by the requirement for contrastive datasets or large…

Computation and Language · Computer Science 2026-05-05 Seonglae Cho , Zekun Wu , Adriano Koshiyama

A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering

Sparse autoencoders (SAEs) have recently emerged as a powerful tool for language model steering. Prior work has explored top-k SAE latents for steering, but we observe that many dimensions among the top-k latents capture non-semantic…

Computation and Language · Computer Science 2025-10-03 Jiaqing Xie

Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection

Sparse autoencoders (SAEs) enable feature-level mechanistic interpretability and activation steering in large language models (LLMs), but SAE-based language control remains unreliable in multilingual settings: most SAEs are trained on…

Computation and Language · Computer Science 2026-05-25 Yusser Al Ghussin , Daniil Gurgurov , Tanja Baeumel , Josef van Genabith , Patrick Schramowski , Simon Ostermann

SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models

The ability of large language models (LLMs) to follow instructions is crucial for their practical applications, yet the underlying mechanisms remain poorly understood. This paper presents a novel framework that leverages sparse autoencoders…

Machine Learning · Computer Science 2025-02-18 Zirui He , Haiyan Zhao , Yiran Qiao , Fan Yang , Ali Payani , Jing Ma , Mengnan Du

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses. Understanding their internal states is crucial for understanding their successes, diagnosing their failures,…

Computation and Language · Computer Science 2025-02-24 Xuansheng Wu , Jiayi Yuan , Wenlin Yao , Xiaoming Zhai , Ninghao Liu

Beyond Interpretability: When, Why, and How Sparse Autoencoders Enable Label-Free Visual Steering

Sparse Autoencoders (SAEs) are increasingly used to interpret foundation models, but their role as an actionable intervention space remains less understood, especially in vision. We study whether sparse visual features can be used not only…

Computer Vision and Pattern Recognition · Computer Science 2026-05-28 Gerasimos Chatzoudis , Zhuowei Li , Gemma E. Moran , Hao Wang , Dimitris N. Metaxas

Learning Retrieval Models with Sparse Autoencoders

Sparse autoencoders (SAEs) provide a powerful mechanism for decomposing the dense representations produced by Large Language Models (LLMs) into interpretable latent features. We posit that SAEs constitute a natural foundation for Learned…

Machine Learning · Computer Science 2026-03-17 Thibault Formal , Maxime Louis , Hervé Dejean , Stéphane Clinchant

Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, their generalization is inconsistent: while these models can perform impressively in some settings, fine-tuned…

Robotics · Computer Science 2026-03-20 Aiden Swann , Lachlain McGranahan , Hugo Buurmeijer , Monroe Kennedy , Mac Schwager

Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation

Large vision-language models (LVLMs) have achieved remarkable performance on multimodal tasks. However, they still suffer from hallucinations, generating text inconsistent with visual input, posing significant risks in real-world…

Computer Vision and Pattern Recognition · Computer Science 2025-09-16 Zhenglin Hua , Jinghan He , Zijun Yao , Tianxu Han , Haiyun Guo , Yuheng Jia , Junfeng Fang

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations. However, there is a lack of evidence regarding the validity of their interpretations due to the lack of a…

Machine Learning · Computer Science 2025-02-25 Subhash Kantamneni , Joshua Engels , Senthooran Rajamanoharan , Max Tegmark , Neel Nanda

I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Recent LLMs like DeepSeek-R1 have demonstrated state-of-the-art performance by integrating deep thinking and complex reasoning during generation. However, the internal mechanisms behind these reasoning processes remain unexplored. We…

Computation and Language · Computer Science 2025-08-07 Andrey Galichin , Alexey Dontsov , Polina Druzhinina , Anton Razzhigaev , Oleg Y. Rogov , Elena Tutubalina , Ivan Oseledets

SAEs Are Good for Steering -- If You Select the Right Features

Sparse Autoencoders (SAEs) have been proposed as an unsupervised approach to learn a decomposition of a model's latent space. This enables useful applications such as steering - influencing the output of a model towards a desired concept -…

Machine Learning · Computer Science 2025-12-23 Dana Arad , Aaron Mueller , Yonatan Belinkov

Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone

Sparse Autoencoders (SAEs) are widely employed for mechanistic interpretability and model steering. Within this context, steering is by design performed by means of decoding altered SAE intermediate representations. This procedure…

Machine Learning · Computer Science 2025-12-08 Antonio Bărbălau , Cristian Daniel Păduraru , Teodor Poncu , Alexandru Tifrea , Elena Burceanu