Related papers: Efficient Dictionary Learning with Switch Sparse A…

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering…

Machine Learning · Computer Science 2025-04-02 Jeffrey Olmo , Jared Wilson , Max Forsey , Bryce Hepner , Thomas Vin Howe , David Wingate

Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder

Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models (LLMs) by decomposing token activations into combinations of human-understandable features. While SAEs provide crucial insights into LLM…

Machine Learning · Computer Science 2025-11-11 Zhen Xu , Zhen Tan , Song Wang , Kaidi Xu , Tianlong Chen

Improving Dictionary Learning with Gated Sparse Autoencoders

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We…

Machine Learning · Computer Science 2024-05-01 Senthooran Rajamanoharan , Arthur Conmy , Lewis Smith , Tom Lieberum , Vikrant Varma , János Kramár , Rohin Shah , Neel Nanda

Transcoders Beat Sparse Autoencoders for Interpretability

Sparse autoencoders (SAEs) extract human-interpretable features from deep neural networks by transforming their activations into a sparse, higher dimensional latent space, and then reconstructing the activations from these latents.…

Machine Learning · Computer Science 2025-02-13 Gonçalo Paulo , Stepan Shabalin , Nora Belrose

Can sparse autoencoders make sense of gene expression latent variable models?

Sparse autoencoders (SAEs) have lately been used to uncover interpretable latent features in large language models. By projecting dense embeddings into a much higher-dimensional and sparse space, learned features become disentangled and…

Machine Learning · Computer Science 2025-07-30 Viktoria Schuster

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Identifying the features learned by neural networks is a core challenge in mechanistic interpretability. Sparse autoencoders (SAEs), which learn a sparse, overcomplete dictionary that reconstructs a network's internal activations, have been…

Machine Learning · Computer Science 2024-05-27 Dan Braun , Jordan Taylor , Nicholas Goldowsky-Dill , Lee Sharkey

Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders

Sparse autoencoders (SAEs) are a promising approach to extracting features from neural networks, enabling model interpretability as well as causal interventions on model internals. SAEs generate sparse feature representations using a…

Machine Learning · Computer Science 2024-11-11 Kola Ayonrinde

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks

Sparse Autoencoders (SAEs) are an interpretability technique aimed at decomposing neural network activations into interpretable units. However, a major bottleneck for SAE development has been the lack of high-quality performance metrics,…

Machine Learning · Computer Science 2024-12-02 Adam Karvonen , Can Rager , Samuel Marks , Neel Nanda

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and…

Machine Learning · Computer Science 2026-02-17 Anton Korznikov , Andrey Galichin , Alexey Dontsov , Oleg Rogov , Ivan Oseledets , Elena Tutubalina

SMIXAE: Towards Unsupervised Manifold Discovery in Language Models

Sparse autoencoders (SAEs) have been used widely to decompose and interpret neural network activations, especially those of transformer language models. One key issue with SAEs is their inability to directly model multidimensional features.…

Machine Learning · Computer Science 2026-05-12 Collin Francel

Interpretable and Testable Vision Features via Sparse Autoencoders

To truly understand vision models, we must not only interpret their learned features but also validate these interpretations through controlled experiments. While earlier work offers either rich semantics or direct control, few post-hoc…

Computer Vision and Pattern Recognition · Computer Science 2025-11-25 Samuel Stevens , Wei-Lun Chao , Tanya Berger-Wolf , Yu Su

Features Emerge as Discrete States: The First Application of SAEs to 3D Representations

Sparse Autoencoders (SAEs) are a powerful dictionary learning technique for decomposing neural network activations, translating the hidden state into human ideas with high semantic value despite no external intervention or guidance.…

Machine Learning · Computer Science 2025-12-17 Albert Miao , Chenliang Zhou , Jiawei Zhou , Cengiz Oztireli

Sparse Autoencoders, Again?

Is there really much more to say about sparse autoencoders (SAEs)? Autoencoders in general, and SAEs in particular, represent deep architectures that are capable of modeling low-dimensional latent structure in data. Such structure could…

Machine Learning · Computer Science 2025-06-09 Yin Lu , Xuening Zhu , Tong He , David Wipf

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising…

Computation and Language · Computer Science 2026-02-27 Usha Bhalla , Alex Oesterling , Claudio Mayrink Verdun , Himabindu Lakkaraju , Flavio P. Calmon

Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry

Sparse Autoencoders (SAEs) are widely used to interpret neural networks by identifying meaningful concepts from their representations. However, do SAEs truly uncover all concepts a model relies on, or are they inherently biased toward…

Machine Learning · Computer Science 2025-12-03 Sai Sumedh R. Hindupur , Ekdeep Singh Lubana , Thomas Fel , Demba Ba

Sparse Autoencoders Trained on the Same Data Learn Different Features

Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in the activations of large language models (LLMs). While some expect SAEs to find the true underlying features used by a model, our research shows…

Machine Learning · Computer Science 2025-01-31 Gonçalo Paulo , Nora Belrose

Incorporating Hierarchical Semantics in Sparse Autoencoder Architectures

Sparse dictionary learning (and, in particular, sparse autoencoders) attempts to learn a set of human-understandable concepts that can explain variation on an abstract space. A basic limitation of this approach is that it neither exploits…

Computation and Language · Computer Science 2025-06-03 Mark Muchane , Sean Richardson , Kiho Park , Victor Veitch

Sparse Autoencoder Features for Classifications and Transferability

Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze…

Machine Learning · Computer Science 2026-02-03 Jack Gallifant , Shan Chen , Kuleen Sasse , Hugo Aerts , Thomas Hartvigsen , Danielle S. Bitterman

Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse Autoencoders

Sparse autoencoders (SAEs) have proven useful in disentangling the opaque activations of neural networks, primarily large language models, into sets of interpretable features. However, adapting them to domains beyond language, such as…

Machine Learning · Computer Science 2025-11-13 Ege Erdogan , Ana Lucic

Train One Sparse Autoencoder Across Multiple Sparsity Budgets to Preserve Interpretability and Accuracy

Sparse Autoencoders (SAEs) have proven to be powerful tools for interpreting neural networks by decomposing hidden representations into disentangled, interpretable features via sparsity constraints. However, conventional SAEs are…

Machine Learning · Computer Science 2025-06-06 Nikita Balagansky , Yaroslav Aksenov , Daniil Laptev , Vadim Kurochkin , Gleb Gerasimov , Nikita Koryagin , Daniil Gavrilov