Related papers: Binary Sparse Coding for Interpretability

Transcoders Beat Sparse Autoencoders for Interpretability

Sparse autoencoders (SAEs) extract human-interpretable features from deep neural networks by transforming their activations into a sparse, higher dimensional latent space, and then reconstructing the activations from these latents.…

Machine Learning · Computer Science 2025-02-13 Gonçalo Paulo , Stepan Shabalin , Nora Belrose

Sparse Autoencoder Features for Classifications and Transferability

Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze…

Machine Learning · Computer Science 2026-02-03 Jack Gallifant , Shan Chen , Kuleen Sasse , Hugo Aerts , Thomas Hartvigsen , Danielle S. Bitterman

Train One Sparse Autoencoder Across Multiple Sparsity Budgets to Preserve Interpretability and Accuracy

Sparse Autoencoders (SAEs) have proven to be powerful tools for interpreting neural networks by decomposing hidden representations into disentangled, interpretable features via sparsity constraints. However, conventional SAEs are…

Machine Learning · Computer Science 2025-06-06 Nikita Balagansky , Yaroslav Aksenov , Daniil Laptev , Vadim Kurochkin , Gleb Gerasimov , Nikita Koryagin , Daniil Gavrilov

Binary Autoencoder for Mechanistic Interpretability of Large Language Models

Existing works are dedicated to untangling atomized numerical components (features) from the hidden states of Large Language Models (LLMs). However, they typically rely on autoencoders constrained by some training-time regularization on…

Machine Learning · Computer Science 2026-02-13 Hakaze Cho , Haolin Yang , Yanshu Li , Brian M. Kurkoski , Naoya Inoue

Ensembling Sparse Autoencoders

Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features. Typically, features learned by a single SAE are used for downstream applications. However, it has recently been shown that SAEs…

Machine Learning · Computer Science 2025-05-23 Soham Gadgil , Chris Lin , Su-In Lee

Tokenized SAEs: Disentangling SAE Reconstructions

Sparse auto-encoders (SAEs) have become a prevalent tool for interpreting language models' inner workings. However, it is unknown how tightly SAE features correspond to computationally important directions in the model. This work…

Machine Learning · Computer Science 2025-02-25 Thomas Dooms , Daniel Wilhelm

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising…

Computation and Language · Computer Science 2026-02-27 Usha Bhalla , Alex Oesterling , Claudio Mayrink Verdun , Himabindu Lakkaraju , Flavio P. Calmon

Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders

A recent line of work has shown promise in using sparse autoencoders (SAEs) to uncover interpretable features in neural network representations. However, the simple linear-nonlinear encoding mechanism in SAEs limits their ability to perform…

Machine Learning · Computer Science 2025-01-31 Charles O'Neill , Alim Gumran , David Klindt

Interpretable and Testable Vision Features via Sparse Autoencoders

To truly understand vision models, we must not only interpret their learned features but also validate these interpretations through controlled experiments. While earlier work offers either rich semantics or direct control, few post-hoc…

Computer Vision and Pattern Recognition · Computer Science 2025-11-25 Samuel Stevens , Wei-Lun Chao , Tanya Berger-Wolf , Yu Su

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and…

Machine Learning · Computer Science 2026-02-17 Anton Korznikov , Andrey Galichin , Alexey Dontsov , Oleg Rogov , Ivan Oseledets , Elena Tutubalina

Interpreting Attention Layer Outputs with Sparse Autoencoders

Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for decomposing the internal activations of trained transformers into sparse,…

Machine Learning · Computer Science 2024-06-26 Connor Kissane , Robert Krzyzanowski , Joseph Isaac Bloom , Arthur Conmy , Neel Nanda

Improving Dictionary Learning with Gated Sparse Autoencoders

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We…

Machine Learning · Computer Science 2024-05-01 Senthooran Rajamanoharan , Arthur Conmy , Lewis Smith , Tom Lieberum , Vikrant Varma , János Kramár , Rohin Shah , Neel Nanda

Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders

We study the challenge of achieving theoretically grounded feature recovery using Sparse Autoencoders (SAEs) for the interpretation of Large Language Models. Existing SAE training algorithms often lack rigorous mathematical guarantees and…

Machine Learning · Computer Science 2025-06-18 Siyu Chen , Heejune Sheen , Xuyuan Xiong , Tianhao Wang , Zhuoran Yang

Measuring Sparse Autoencoder Feature Sensitivity

Sparse Autoencoder (SAE) features have become essential tools for mechanistic interpretability research. SAE features are typically characterized by examining their activating examples, which are often "monosemantic" and align with human…

Artificial Intelligence · Computer Science 2025-09-30 Claire Tian , Katherine Tian , Nathan Hu

Efficient Dictionary Learning with Switch Sparse Autoencoders

Sparse autoencoders (SAEs) are a recent technique for decomposing neural network activations into human-interpretable features. However, in order for SAEs to identify all features represented in frontier models, it will be necessary to…

Machine Learning · Computer Science 2025-06-04 Anish Mudide , Joshua Engels , Eric J. Michaud , Max Tegmark , Christian Schroeder de Witt

Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder

Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models (LLMs) by decomposing token activations into combinations of human-understandable features. While SAEs provide crucial insights into LLM…

Machine Learning · Computer Science 2025-11-11 Zhen Xu , Zhen Tan , Song Wang , Kaidi Xu , Tianlong Chen

Towards Interpretable Framework for Neural Audio Codecs via Sparse Autoencoders: A Case Study on Accent Information

Neural Audio Codecs (NACs) are widely adopted in modern speech systems, yet how they encode linguistic and paralinguistic information remains unclear. Improving the interpretability of NAC representations is critical for understanding and…

Sound · Computer Science 2026-03-20 Shih-Heng Wang , Tiantian Feng , Aditya Kommineni , Thanathai Lertpetchpun , Bowen Yi , Xuan Shi , Shrikanth Narayanan

Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry

Sparse Autoencoders (SAEs) are widely used to interpret neural networks by identifying meaningful concepts from their representations. However, do SAEs truly uncover all concepts a model relies on, or are they inherently biased toward…

Machine Learning · Computer Science 2025-12-03 Sai Sumedh R. Hindupur , Ekdeep Singh Lubana , Thomas Fel , Demba Ba

Evaluating Sparse Autoencoders for Monosemantic Representation

A key barrier to interpreting large language models is polysemanticity, where neurons activate for multiple unrelated concepts. Sparse autoencoders (SAEs) have been proposed to mitigate this issue by transforming dense activations into…

Machine Learning · Computer Science 2025-10-20 Moghis Fereidouni , Muhammad Umair Haider , Peizhong Ju , A. B. Siddique

Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse Autoencoders

Sparse autoencoders (SAEs) have proven useful in disentangling the opaque activations of neural networks, primarily large language models, into sets of interpretable features. However, adapting them to domains beyond language, such as…

Machine Learning · Computer Science 2025-11-13 Ege Erdogan , Ana Lucic