Related papers: BatchTopK Sparse Autoencoders

Distribution-Aware Feature Selection for SAEs

Sparse autoencoders (SAEs) decompose neural activations into interpretable features. A widely adopted variant, the TopK SAE, reconstructs each token from its K most active latents. However, this approach is inefficient, as some tokens carry…

Machine Learning · Computer Science 2025-09-01 Narmeen Oozeer , Nirmalendu Prakash , Michael Lan , Alice Rigg , Amirali Abdullah

TopK Language Models

Sparse autoencoders (SAEs) have become an important tool for analyzing and interpreting the activation space of transformer-based language models (LMs). However, SAEs suffer several shortcomings that diminish their utility and internal…

Computation and Language · Computer Science 2025-06-27 Ryosuke Takahashi , Tatsuro Inaba , Kentaro Inui , Benjamin Heinzerling

AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features

Sparse autoencoders (SAEs) have emerged as powerful techniques for interpretability of large language models (LLMs), aiming to decompose hidden states into meaningful semantic features. While several SAE variants have been proposed, there…

Machine Learning · Computer Science 2025-10-03 Xudong Zhu , Mohammad Mahdi Khalili , Zhihui Zhu

Train One Sparse Autoencoder Across Multiple Sparsity Budgets to Preserve Interpretability and Accuracy

Sparse Autoencoders (SAEs) have proven to be powerful tools for interpreting neural networks by decomposing hidden representations into disentangled, interpretable features via sparsity constraints. However, conventional SAEs are…

Machine Learning · Computer Science 2025-06-06 Nikita Balagansky , Yaroslav Aksenov , Daniil Laptev , Vadim Kurochkin , Gleb Gerasimov , Nikita Koryagin , Daniil Gavrilov

SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders

Sparse Autoencoders (SAEs) have become an important tool in mechanistic interpretability, helping to analyze internal representations in both Large Language Models (LLMs) and Vision Transformers (ViTs). By decomposing polysemantic…

Machine Learning · Computer Science 2026-05-11 Jakub Stępień , Marcin Mazur , Jacek Tabor , Przemysław Spurek

A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering

Sparse autoencoders (SAEs) have recently emerged as a powerful tool for language model steering. Prior work has explored top-k SAE latents for steering, but we observe that many dimensions among the top-k latents capture non-semantic…

Computation and Language · Computer Science 2025-10-03 Jiaqing Xie

Improving Sparse Autoencoder with Dynamic Attention

Recently, sparse autoencoders (SAEs) have emerged as a promising technique for interpreting activations in foundation models by disentangling features into a sparse set of concepts. However, identifying the optimal level of sparsity for…

Machine Learning · Computer Science 2026-04-17 Dongsheng Wang , Jinsen Zhang , Dawei Su , Hui Huang

Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need

Sparse autoencoders (SAEs) are widely used for interpreting language model activations. A key evaluation metric is the increase in cross-entropy loss between the original model logits and the reconstructed model logits when replacing model…

Machine Learning · Computer Science 2025-04-01 Adam Karvonen

Sparse Autoencoders Trained on the Same Data Learn Different Features

Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in the activations of large language models (LLMs). While some expect SAEs to find the true underlying features used by a model, our research shows…

Machine Learning · Computer Science 2025-01-31 Gonçalo Paulo , Nora Belrose

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be useful for downstream tasks, SAEs need to decompose LM…

Machine Learning · Computer Science 2024-08-02 Senthooran Rajamanoharan , Tom Lieberum , Nicolas Sonnerat , Arthur Conmy , Vikrant Varma , János Kramár , Neel Nanda

Improving Dictionary Learning with Gated Sparse Autoencoders

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We…

Machine Learning · Computer Science 2024-05-01 Senthooran Rajamanoharan , Arthur Conmy , Lewis Smith , Tom Lieberum , Vikrant Varma , János Kramár , Rohin Shah , Neel Nanda

Training Superior Sparse Autoencoders for Instruct Models

As large language models (LLMs) grow in scale and capability, understanding their internal mechanisms becomes increasingly critical. Sparse autoencoders (SAEs) have emerged as a key tool in mechanistic interpretability, enabling the…

Computation and Language · Computer Science 2025-06-10 Jiaming Li , Haoran Ye , Yukun Chen , Xinyue Li , Lei Zhang , Hamid Alinejad-Rokny , Jimmy Chih-Hsien Peng , Min Yang

Interpreting Attention Layer Outputs with Sparse Autoencoders

Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for decomposing the internal activations of trained transformers into sparse,…

Machine Learning · Computer Science 2024-06-26 Connor Kissane , Robert Krzyzanowski , Joseph Isaac Bloom , Arthur Conmy , Neel Nanda

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering…

Machine Learning · Computer Science 2025-04-02 Jeffrey Olmo , Jared Wilson , Max Forsey , Bryce Hepner , Thomas Vin Howe , David Wingate

Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality

Sparse autoencoders (SAEs) are widely used in mechanistic interpretability research for large language models; however, the state-of-the-art method of using $k$-sparse autoencoders lacks a theoretical grounding for selecting the…

Machine Learning · Computer Science 2025-08-11 Sewoong Lee , Adam Davies , Marc E. Canby , Julia Hockenmaier

Mechanistic Interpretability of Antibody Language Models Using SAEs

Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate autoregressive…

Machine Learning · Computer Science 2026-05-27 Rebonto Haque , Oliver M. Turnbull , Anisha Parsan , Nithin Parsan , John J. Yang , Anna L. Beukenhorst , Charlotte M. Deane

Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone

Sparse Autoencoders (SAEs) are widely employed for mechanistic interpretability and model steering. Within this context, steering is by design performed by means of decoding altered SAE intermediate representations. This procedure…

Machine Learning · Computer Science 2025-12-08 Antonio Bărbălau , Cristian Daniel Păduraru , Teodor Poncu , Alexandru Tifrea , Elena Burceanu

AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations

Understanding the internal representations of large language models (LLMs) remains a central challenge for interpretability research. Sparse autoencoders (SAEs) offer a promising solution by decomposing activations into interpretable…

Machine Learning · Computer Science 2025-10-10 Yifei Yao , Mengnan Du

Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders

Sparse autoencoders (SAEs) are a promising approach to extracting features from neural networks, enabling model interpretability as well as causal interventions on model internals. SAEs generate sparse feature representations using a…

Machine Learning · Computer Science 2024-11-11 Kola Ayonrinde

Low-Rank Adapting Models for Sparse Autoencoders

Sparse autoencoders (SAEs) decompose language model representations into a sparse set of linear latent vectors. Recent works have improved SAEs using language model gradients, but these techniques require many expensive backward passes…

Machine Learning · Computer Science 2025-05-28 Matthew Chen , Joshua Engels , Max Tegmark