English
Related papers

Related papers: BatchTopK Sparse Autoencoders

200 papers

Sparse autoencoders (SAEs) decompose neural activations into interpretable features. A widely adopted variant, the TopK SAE, reconstructs each token from its K most active latents. However, this approach is inefficient, as some tokens carry…

Machine Learning · Computer Science 2025-09-01 Narmeen Oozeer , Nirmalendu Prakash , Michael Lan , Alice Rigg , Amirali Abdullah

Sparse autoencoders (SAEs) have become an important tool for analyzing and interpreting the activation space of transformer-based language models (LMs). However, SAEs suffer several shortcomings that diminish their utility and internal…

Computation and Language · Computer Science 2025-06-27 Ryosuke Takahashi , Tatsuro Inaba , Kentaro Inui , Benjamin Heinzerling

Sparse autoencoders (SAEs) have emerged as powerful techniques for interpretability of large language models (LLMs), aiming to decompose hidden states into meaningful semantic features. While several SAE variants have been proposed, there…

Machine Learning · Computer Science 2025-10-03 Xudong Zhu , Mohammad Mahdi Khalili , Zhihui Zhu

Sparse Autoencoders (SAEs) have proven to be powerful tools for interpreting neural networks by decomposing hidden representations into disentangled, interpretable features via sparsity constraints. However, conventional SAEs are…

Sparse Autoencoders (SAEs) have become an important tool in mechanistic interpretability, helping to analyze internal representations in both Large Language Models (LLMs) and Vision Transformers (ViTs). By decomposing polysemantic…

Machine Learning · Computer Science 2026-05-11 Jakub Stępień , Marcin Mazur , Jacek Tabor , Przemysław Spurek

Sparse autoencoders (SAEs) have recently emerged as a powerful tool for language model steering. Prior work has explored top-k SAE latents for steering, but we observe that many dimensions among the top-k latents capture non-semantic…

Computation and Language · Computer Science 2025-10-03 Jiaqing Xie

Recently, sparse autoencoders (SAEs) have emerged as a promising technique for interpreting activations in foundation models by disentangling features into a sparse set of concepts. However, identifying the optimal level of sparsity for…

Machine Learning · Computer Science 2026-04-17 Dongsheng Wang , Jinsen Zhang , Dawei Su , Hui Huang

Sparse autoencoders (SAEs) are widely used for interpreting language model activations. A key evaluation metric is the increase in cross-entropy loss between the original model logits and the reconstructed model logits when replacing model…

Machine Learning · Computer Science 2025-04-01 Adam Karvonen

Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in the activations of large language models (LLMs). While some expect SAEs to find the true underlying features used by a model, our research shows…

Machine Learning · Computer Science 2025-01-31 Gonçalo Paulo , Nora Belrose

Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be useful for downstream tasks, SAEs need to decompose LM…

Machine Learning · Computer Science 2024-08-02 Senthooran Rajamanoharan , Tom Lieberum , Nicolas Sonnerat , Arthur Conmy , Vikrant Varma , János Kramár , Neel Nanda

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We…

Machine Learning · Computer Science 2024-05-01 Senthooran Rajamanoharan , Arthur Conmy , Lewis Smith , Tom Lieberum , Vikrant Varma , János Kramár , Rohin Shah , Neel Nanda

As large language models (LLMs) grow in scale and capability, understanding their internal mechanisms becomes increasingly critical. Sparse autoencoders (SAEs) have emerged as a key tool in mechanistic interpretability, enabling the…

Computation and Language · Computer Science 2025-06-10 Jiaming Li , Haoran Ye , Yukun Chen , Xinyue Li , Lei Zhang , Hamid Alinejad-Rokny , Jimmy Chih-Hsien Peng , Min Yang

Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for decomposing the internal activations of trained transformers into sparse,…

Machine Learning · Computer Science 2024-06-26 Connor Kissane , Robert Krzyzanowski , Joseph Isaac Bloom , Arthur Conmy , Neel Nanda

Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering…

Machine Learning · Computer Science 2025-04-02 Jeffrey Olmo , Jared Wilson , Max Forsey , Bryce Hepner , Thomas Vin Howe , David Wingate

Sparse autoencoders (SAEs) are widely used in mechanistic interpretability research for large language models; however, the state-of-the-art method of using $k$-sparse autoencoders lacks a theoretical grounding for selecting the…

Machine Learning · Computer Science 2025-08-11 Sewoong Lee , Adam Davies , Marc E. Canby , Julia Hockenmaier

Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate autoregressive…

Sparse Autoencoders (SAEs) are widely employed for mechanistic interpretability and model steering. Within this context, steering is by design performed by means of decoding altered SAE intermediate representations. This procedure…

Machine Learning · Computer Science 2025-12-08 Antonio Bărbălau , Cristian Daniel Păduraru , Teodor Poncu , Alexandru Tifrea , Elena Burceanu

Understanding the internal representations of large language models (LLMs) remains a central challenge for interpretability research. Sparse autoencoders (SAEs) offer a promising solution by decomposing activations into interpretable…

Machine Learning · Computer Science 2025-10-10 Yifei Yao , Mengnan Du

Sparse autoencoders (SAEs) are a promising approach to extracting features from neural networks, enabling model interpretability as well as causal interventions on model internals. SAEs generate sparse feature representations using a…

Machine Learning · Computer Science 2024-11-11 Kola Ayonrinde

Sparse autoencoders (SAEs) decompose language model representations into a sparse set of linear latent vectors. Recent works have improved SAEs using language model gradients, but these techniques require many expensive backward passes…

Machine Learning · Computer Science 2025-05-28 Matthew Chen , Joshua Engels , Max Tegmark
‹ Prev 1 2 3 10 Next ›