Related papers: Improving Sparse Autoencoder with Dynamic Attentio…

Train One Sparse Autoencoder Across Multiple Sparsity Budgets to Preserve Interpretability and Accuracy

Sparse Autoencoders (SAEs) have proven to be powerful tools for interpreting neural networks by decomposing hidden representations into disentangled, interpretable features via sparsity constraints. However, conventional SAEs are…

Machine Learning · Computer Science 2025-06-06 Nikita Balagansky , Yaroslav Aksenov , Daniil Laptev , Vadim Kurochkin , Gleb Gerasimov , Nikita Koryagin , Daniil Gavrilov

BatchTopK Sparse Autoencoders

Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting language model activations by decomposing them into sparse, interpretable features. A popular approach is the TopK SAE, that uses a fixed number of the most active…

Machine Learning · Computer Science 2024-12-10 Bart Bussmann , Patrick Leask , Neel Nanda

SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders

Sparse Autoencoders (SAEs) have become an important tool in mechanistic interpretability, helping to analyze internal representations in both Large Language Models (LLMs) and Vision Transformers (ViTs). By decomposing polysemantic…

Machine Learning · Computer Science 2026-05-11 Jakub Stępień , Marcin Mazur , Jacek Tabor , Przemysław Spurek

Interpreting Attention Layer Outputs with Sparse Autoencoders

Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for decomposing the internal activations of trained transformers into sparse,…

Machine Learning · Computer Science 2024-06-26 Connor Kissane , Robert Krzyzanowski , Joseph Isaac Bloom , Arthur Conmy , Neel Nanda

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be useful for downstream tasks, SAEs need to decompose LM…

Machine Learning · Computer Science 2024-08-02 Senthooran Rajamanoharan , Tom Lieberum , Nicolas Sonnerat , Arthur Conmy , Vikrant Varma , János Kramár , Neel Nanda

Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE)

Sparse autoencoders (SAEs) are one of the main methods to interpret the inner workings of deep neural networks (DNNs), decomposing activations into higher-dimensional features. However, they exhibit critical shortcomings where a large…

Machine Learning · Computer Science 2026-05-19 Michał Brzozowski , Neo Christopher Chung

Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders

Sparse autoencoders (SAEs) are a promising approach to extracting features from neural networks, enabling model interpretability as well as causal interventions on model internals. SAEs generate sparse feature representations using a…

Machine Learning · Computer Science 2024-11-11 Kola Ayonrinde

AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations

Understanding the internal representations of large language models (LLMs) remains a central challenge for interpretability research. Sparse autoencoders (SAEs) offer a promising solution by decomposing activations into interpretable…

Machine Learning · Computer Science 2025-10-10 Yifei Yao , Mengnan Du

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering…

Machine Learning · Computer Science 2025-04-02 Jeffrey Olmo , Jared Wilson , Max Forsey , Bryce Hepner , Thomas Vin Howe , David Wingate

Improving Robustness In Sparse Autoencoders via Masked Regularization

Sparse autoencoders (SAEs) are widely used in mechanistic interpretability to project LLM activations onto sparse latent spaces. However, sparsity alone is an imperfect proxy for interpretability, and current training objectives often…

Machine Learning · Computer Science 2026-04-09 Vivek Narayanaswamy , Kowshik Thopalli , Bhavya Kailkhura , Wesam Sakla

Efficient Dictionary Learning with Switch Sparse Autoencoders

Sparse autoencoders (SAEs) are a recent technique for decomposing neural network activations into human-interpretable features. However, in order for SAEs to identify all features represented in frontier models, it will be necessary to…

Machine Learning · Computer Science 2025-06-04 Anish Mudide , Joshua Engels , Eric J. Michaud , Max Tegmark , Christian Schroeder de Witt

Automatically Interpreting Millions of Features in Large Language Models

While the activations of neurons in deep neural networks usually do not have a simple human-understandable interpretation, sparse autoencoders (SAEs) can be used to transform these activations into a higher-dimensional latent space which…

Machine Learning · Computer Science 2025-08-07 Gonçalo Paulo , Alex Mallen , Caden Juang , Nora Belrose

A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering

Sparse autoencoders (SAEs) have recently emerged as a powerful tool for language model steering. Prior work has explored top-k SAE latents for steering, but we observe that many dimensions among the top-k latents capture non-semantic…

Computation and Language · Computer Science 2025-10-03 Jiaqing Xie

Binary Sparse Coding for Interpretability

Sparse autoencoders (SAEs) are used to decompose neural network activations into sparsely activating features, but many SAE features are only interpretable at high activation strengths. To address this issue we propose to use binary sparse…

Machine Learning · Computer Science 2025-10-01 Lucia Quirke , Stepan Shabalin , Nora Belrose

Improving Dictionary Learning with Gated Sparse Autoencoders

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We…

Machine Learning · Computer Science 2024-05-01 Senthooran Rajamanoharan , Arthur Conmy , Lewis Smith , Tom Lieberum , Vikrant Varma , János Kramár , Rohin Shah , Neel Nanda

Interpretable and Testable Vision Features via Sparse Autoencoders

To truly understand vision models, we must not only interpret their learned features but also validate these interpretations through controlled experiments. While earlier work offers either rich semantics or direct control, few post-hoc…

Computer Vision and Pattern Recognition · Computer Science 2025-11-25 Samuel Stevens , Wei-Lun Chao , Tanya Berger-Wolf , Yu Su

Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders

Sparse autoencoders (SAEs) decompose large language model (LLM) activations into latent features that reveal mechanistic structure. Conventional SAEs train on broad data distributions, forcing a fixed latent budget to capture only…

Machine Learning · Computer Science 2025-08-14 Charles O'Neill , Mudith Jayasekara , Max Kirkby

Sparse Autoencoders Trained on the Same Data Learn Different Features

Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in the activations of large language models (LLMs). While some expect SAEs to find the true underlying features used by a model, our research shows…

Machine Learning · Computer Science 2025-01-31 Gonçalo Paulo , Nora Belrose

Evaluating Sparse Autoencoders for Monosemantic Representation

A key barrier to interpreting large language models is polysemanticity, where neurons activate for multiple unrelated concepts. Sparse autoencoders (SAEs) have been proposed to mitigate this issue by transforming dense activations into…

Machine Learning · Computer Science 2025-10-20 Moghis Fereidouni , Muhammad Umair Haider , Peizhong Ju , A. B. Siddique

Analysis of Variational Sparse Autoencoders

Sparse Autoencoders (SAEs) have emerged as a promising approach for interpreting neural network representations by learning sparse, human-interpretable features from dense activations. We investigate whether incorporating variational…

Machine Learning · Computer Science 2025-10-03 Zachary Baker , Yuxiao Li