Related papers: Transcoders Beat Sparse Autoencoders for Interpret…

Evaluating SAE interpretability without explanations

Sparse autoencoders (SAEs) and transcoders have become important tools for machine learning interpretability. However, measuring how interpretable they are remains challenging, with weak consensus about which benchmarks to use. Most…

Machine Learning · Computer Science 2025-07-14 Gonçalo Paulo , Nora Belrose

Interpreting Attention Layer Outputs with Sparse Autoencoders

Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for decomposing the internal activations of trained transformers into sparse,…

Machine Learning · Computer Science 2024-06-26 Connor Kissane , Robert Krzyzanowski , Joseph Isaac Bloom , Arthur Conmy , Neel Nanda

Beyond Black Boxes: Enhancing Interpretability of Transformers Trained on Neural Data

Transformer models have become state-of-the-art in decoding stimuli and behavior from neural activity, significantly advancing neuroscience research. Yet greater transparency in their decision-making processes would substantially enhance…

Quantitative Methods · Quantitative Biology 2025-06-18 Laurence Freeman , Philip Shamash , Vinam Arora , Caswell Barry , Tiago Branco , Eva Dyer

Binary Sparse Coding for Interpretability

Sparse autoencoders (SAEs) are used to decompose neural network activations into sparsely activating features, but many SAE features are only interpretable at high activation strengths. To address this issue we propose to use binary sparse…

Machine Learning · Computer Science 2025-10-01 Lucia Quirke , Stepan Shabalin , Nora Belrose

Efficient Dictionary Learning with Switch Sparse Autoencoders

Sparse autoencoders (SAEs) are a recent technique for decomposing neural network activations into human-interpretable features. However, in order for SAEs to identify all features represented in frontier models, it will be necessary to…

Machine Learning · Computer Science 2025-06-04 Anish Mudide , Joshua Engels , Eric J. Michaud , Max Tegmark , Christian Schroeder de Witt

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising…

Computation and Language · Computer Science 2026-02-27 Usha Bhalla , Alex Oesterling , Claudio Mayrink Verdun , Himabindu Lakkaraju , Flavio P. Calmon

Automatically Interpreting Millions of Features in Large Language Models

While the activations of neurons in deep neural networks usually do not have a simple human-understandable interpretation, sparse autoencoders (SAEs) can be used to transform these activations into a higher-dimensional latent space which…

Machine Learning · Computer Science 2025-08-07 Gonçalo Paulo , Alex Mallen , Caden Juang , Nora Belrose

Interpretable and Testable Vision Features via Sparse Autoencoders

To truly understand vision models, we must not only interpret their learned features but also validate these interpretations through controlled experiments. While earlier work offers either rich semantics or direct control, few post-hoc…

Computer Vision and Pattern Recognition · Computer Science 2025-11-25 Samuel Stevens , Wei-Lun Chao , Tanya Berger-Wolf , Yu Su

Can sparse autoencoders make sense of gene expression latent variable models?

Sparse autoencoders (SAEs) have lately been used to uncover interpretable latent features in large language models. By projecting dense embeddings into a much higher-dimensional and sparse space, learned features become disentangled and…

Machine Learning · Computer Science 2025-07-30 Viktoria Schuster

Train One Sparse Autoencoder Across Multiple Sparsity Budgets to Preserve Interpretability and Accuracy

Sparse Autoencoders (SAEs) have proven to be powerful tools for interpreting neural networks by decomposing hidden representations into disentangled, interpretable features via sparsity constraints. However, conventional SAEs are…

Machine Learning · Computer Science 2025-06-06 Nikita Balagansky , Yaroslav Aksenov , Daniil Laptev , Vadim Kurochkin , Gleb Gerasimov , Nikita Koryagin , Daniil Gavrilov

Sparse Autoencoders, Again?

Is there really much more to say about sparse autoencoders (SAEs)? Autoencoders in general, and SAEs in particular, represent deep architectures that are capable of modeling low-dimensional latent structure in data. Such structure could…

Machine Learning · Computer Science 2025-06-09 Yin Lu , Xuening Zhu , Tong He , David Wipf

Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders

A recent line of work has shown promise in using sparse autoencoders (SAEs) to uncover interpretable features in neural network representations. However, the simple linear-nonlinear encoding mechanism in SAEs limits their ability to perform…

Machine Learning · Computer Science 2025-01-31 Charles O'Neill , Alim Gumran , David Klindt

Sparse Autoencoders for Interpretable Medical Image Representation Learning

Vision foundation models (FMs) achieve state-of-the-art performance in medical imaging. However, they encode information in abstract latent representations that clinicians cannot interrogate or verify. The goal of this study is to…

Computer Vision and Pattern Recognition · Computer Science 2026-03-26 Philipp Wesp , Robbie Holland , Vasiliki Sideri-Lampretsa , Sergios Gatidis

AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their…

Sound · Computer Science 2026-02-09 Georgii Aparin , Tasnima Sadekova , Alexey Rukhovich , Assel Yermekova , Laida Kushnareva , Vadim Popov , Kristian Kuznetsov , Irina Piontkovskaya

Sparse Autoencoders for Sequential Recommendation Models: Interpretation and Flexible Control

Many current state-of-the-art models for sequential recommendations are based on transformer architectures. Interpretation and explanation of such black box models is an important research question, as a better understanding of their…

Information Retrieval · Computer Science 2026-02-18 Anton Klenitskiy , Konstantin Polev , Daria Denisova , Alexey Vasilev , Dmitry Simakov , Gleb Gusev

Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse Autoencoders

Sparse autoencoders (SAEs) have proven useful in disentangling the opaque activations of neural networks, primarily large language models, into sets of interpretable features. However, adapting them to domains beyond language, such as…

Machine Learning · Computer Science 2025-11-13 Ege Erdogan , Ana Lucic

Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry

Sparse Autoencoders (SAEs) are widely used to interpret neural networks by identifying meaningful concepts from their representations. However, do SAEs truly uncover all concepts a model relies on, or are they inherently biased toward…

Machine Learning · Computer Science 2025-12-03 Sai Sumedh R. Hindupur , Ekdeep Singh Lubana , Thomas Fel , Demba Ba

Sparse Shift Autoencoders for Identifying Concepts from Large Language Model Activations

Unsupervised approaches to large language model (LLM) interpretability, such as sparse autoencoders (SAEs), offer a way to decode LLM activations into interpretable and, ideally, controllable concepts. On the one hand, these approaches…

Machine Learning · Computer Science 2026-03-03 Shruti Joshi , Andrea Dittadi , Sébastien Lachapelle , Dhanya Sridhar

Transcoders Find Interpretable LLM Feature Circuits

A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based language…

Machine Learning · Computer Science 2024-11-08 Jacob Dunefsky , Philippe Chlenski , Neel Nanda

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering…

Machine Learning · Computer Science 2025-04-02 Jeffrey Olmo , Jared Wilson , Max Forsey , Bryce Hepner , Thomas Vin Howe , David Wingate