English
Related papers

Related papers: Efficient Dictionary Learning with Switch Sparse A…

200 papers

Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering…

Machine Learning · Computer Science 2025-04-02 Jeffrey Olmo , Jared Wilson , Max Forsey , Bryce Hepner , Thomas Vin Howe , David Wingate

Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models (LLMs) by decomposing token activations into combinations of human-understandable features. While SAEs provide crucial insights into LLM…

Machine Learning · Computer Science 2025-11-11 Zhen Xu , Zhen Tan , Song Wang , Kaidi Xu , Tianlong Chen

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We…

Machine Learning · Computer Science 2024-05-01 Senthooran Rajamanoharan , Arthur Conmy , Lewis Smith , Tom Lieberum , Vikrant Varma , János Kramár , Rohin Shah , Neel Nanda

Sparse autoencoders (SAEs) extract human-interpretable features from deep neural networks by transforming their activations into a sparse, higher dimensional latent space, and then reconstructing the activations from these latents.…

Machine Learning · Computer Science 2025-02-13 Gonçalo Paulo , Stepan Shabalin , Nora Belrose

Sparse autoencoders (SAEs) have lately been used to uncover interpretable latent features in large language models. By projecting dense embeddings into a much higher-dimensional and sparse space, learned features become disentangled and…

Machine Learning · Computer Science 2025-07-30 Viktoria Schuster

Identifying the features learned by neural networks is a core challenge in mechanistic interpretability. Sparse autoencoders (SAEs), which learn a sparse, overcomplete dictionary that reconstructs a network's internal activations, have been…

Machine Learning · Computer Science 2024-05-27 Dan Braun , Jordan Taylor , Nicholas Goldowsky-Dill , Lee Sharkey

Sparse autoencoders (SAEs) are a promising approach to extracting features from neural networks, enabling model interpretability as well as causal interventions on model internals. SAEs generate sparse feature representations using a…

Machine Learning · Computer Science 2024-11-11 Kola Ayonrinde

Sparse Autoencoders (SAEs) are an interpretability technique aimed at decomposing neural network activations into interpretable units. However, a major bottleneck for SAE development has been the lack of high-quality performance metrics,…

Machine Learning · Computer Science 2024-12-02 Adam Karvonen , Can Rager , Samuel Marks , Neel Nanda

Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and…

Machine Learning · Computer Science 2026-02-17 Anton Korznikov , Andrey Galichin , Alexey Dontsov , Oleg Rogov , Ivan Oseledets , Elena Tutubalina

Sparse autoencoders (SAEs) have been used widely to decompose and interpret neural network activations, especially those of transformer language models. One key issue with SAEs is their inability to directly model multidimensional features.…

Machine Learning · Computer Science 2026-05-12 Collin Francel

To truly understand vision models, we must not only interpret their learned features but also validate these interpretations through controlled experiments. While earlier work offers either rich semantics or direct control, few post-hoc…

Computer Vision and Pattern Recognition · Computer Science 2025-11-25 Samuel Stevens , Wei-Lun Chao , Tanya Berger-Wolf , Yu Su

Sparse Autoencoders (SAEs) are a powerful dictionary learning technique for decomposing neural network activations, translating the hidden state into human ideas with high semantic value despite no external intervention or guidance.…

Machine Learning · Computer Science 2025-12-17 Albert Miao , Chenliang Zhou , Jiawei Zhou , Cengiz Oztireli

Is there really much more to say about sparse autoencoders (SAEs)? Autoencoders in general, and SAEs in particular, represent deep architectures that are capable of modeling low-dimensional latent structure in data. Such structure could…

Machine Learning · Computer Science 2025-06-09 Yin Lu , Xuening Zhu , Tong He , David Wipf

Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising…

Computation and Language · Computer Science 2026-02-27 Usha Bhalla , Alex Oesterling , Claudio Mayrink Verdun , Himabindu Lakkaraju , Flavio P. Calmon

Sparse Autoencoders (SAEs) are widely used to interpret neural networks by identifying meaningful concepts from their representations. However, do SAEs truly uncover all concepts a model relies on, or are they inherently biased toward…

Machine Learning · Computer Science 2025-12-03 Sai Sumedh R. Hindupur , Ekdeep Singh Lubana , Thomas Fel , Demba Ba

Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in the activations of large language models (LLMs). While some expect SAEs to find the true underlying features used by a model, our research shows…

Machine Learning · Computer Science 2025-01-31 Gonçalo Paulo , Nora Belrose

Sparse dictionary learning (and, in particular, sparse autoencoders) attempts to learn a set of human-understandable concepts that can explain variation on an abstract space. A basic limitation of this approach is that it neither exploits…

Computation and Language · Computer Science 2025-06-03 Mark Muchane , Sean Richardson , Kiho Park , Victor Veitch

Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze…

Machine Learning · Computer Science 2026-02-03 Jack Gallifant , Shan Chen , Kuleen Sasse , Hugo Aerts , Thomas Hartvigsen , Danielle S. Bitterman

Sparse autoencoders (SAEs) have proven useful in disentangling the opaque activations of neural networks, primarily large language models, into sets of interpretable features. However, adapting them to domains beyond language, such as…

Machine Learning · Computer Science 2025-11-13 Ege Erdogan , Ana Lucic

Sparse Autoencoders (SAEs) have proven to be powerful tools for interpreting neural networks by decomposing hidden representations into disentangled, interpretable features via sparsity constraints. However, conventional SAEs are…

‹ Prev 1 2 3 10 Next ›