Related papers: Sparse Autoencoder Features for Classifications an…

A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models

Large Language Models (LLMs) have transformed natural language processing, yet their internal mechanisms remain largely opaque. Recently, mechanistic interpretability has attracted significant attention from the research community as a…

Machine Learning · Computer Science 2025-09-24 Dong Shu , Xuansheng Wu , Haiyan Zhao , Daking Rai , Ziyu Yao , Ninghao Liu , Mengnan Du

Sparse Shift Autoencoders for Identifying Concepts from Large Language Model Activations

Unsupervised approaches to large language model (LLM) interpretability, such as sparse autoencoders (SAEs), offer a way to decode LLM activations into interpretable and, ideally, controllable concepts. On the one hand, these approaches…

Machine Learning · Computer Science 2026-03-03 Shruti Joshi , Andrea Dittadi , Sébastien Lachapelle , Dhanya Sridhar

Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders

Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models, yet scalable training remains a significant challenge. We introduce a suite of 256 SAEs, trained on each…

Machine Learning · Computer Science 2024-10-29 Zhengfu He , Wentao Shu , Xuyang Ge , Lingjie Chen , Junxuan Wang , Yunhua Zhou , Frances Liu , Qipeng Guo , Xuanjing Huang , Zuxuan Wu , Yu-Gang Jiang , Xipeng Qiu

Probing the Representational Power of Sparse Autoencoders in Vision Models

Sparse Autoencoders (SAEs) have emerged as a popular tool for interpreting the hidden states of large language models (LLMs). By learning to reconstruct activations from a sparse bottleneck layer, SAEs discover interpretable features from…

Computer Vision and Pattern Recognition · Computer Science 2025-09-19 Matthew Lyle Olson , Musashi Hinck , Neale Ratzlaff , Changbai Li , Phillip Howard , Vasudev Lal , Shao-Yen Tseng

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations. However, there is a lack of evidence regarding the validity of their interpretations due to the lack of a…

Machine Learning · Computer Science 2025-02-25 Subhash Kantamneni , Joshua Engels , Senthooran Rajamanoharan , Max Tegmark , Neel Nanda

Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder

Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models (LLMs) by decomposing token activations into combinations of human-understandable features. While SAEs provide crucial insights into LLM…

Machine Learning · Computer Science 2025-11-11 Zhen Xu , Zhen Tan , Song Wang , Kaidi Xu , Tianlong Chen

Sparse Autoencoders Trained on the Same Data Learn Different Features

Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in the activations of large language models (LLMs). While some expect SAEs to find the true underlying features used by a model, our research shows…

Machine Learning · Computer Science 2025-01-31 Gonçalo Paulo , Nora Belrose

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

Sparse Autoencoders (SAEs) have recently gained attention as a means to improve the interpretability and steerability of Large Language Models (LLMs), both of which are essential for AI safety. In this work, we extend the application of…

Computer Vision and Pattern Recognition · Computer Science 2025-12-01 Mateusz Pach , Shyamgopal Karthik , Quentin Bouniot , Serge Belongie , Zeynep Akata

Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

Analyzing large-scale text corpora is a core challenge in machine learning, crucial for tasks like identifying undesirable model behaviors or biases in training data. Current methods often rely on costly LLM-based techniques (e.g.…

Artificial Intelligence · Computer Science 2025-12-12 Nick Jiang , Xiaoqing Sun , Lisa Dunlap , Lewis Smith , Neel Nanda

Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders

Sparse Autoencoders (SAEs) have been successfully used to probe Large Language Models (LLMs) and extract interpretable concepts from their internal representations. These concepts are linear combinations of neuron activations that…

Computation and Language · Computer Science 2026-02-23 Mathis Le Bail , Jérémie Dentan , Davide Buscaldi , Sonia Vanier

Sparse Autoencoders Reveal Interpretable Structure in Small Gene Language Models

Sparse autoencoders (SAEs) have recently emerged as a powerful tool for interpreting the internal representations of large language models (LLMs), revealing latent latent features with semantical meaning. This interpretability has also…

Other Quantitative Biology · Quantitative Biology 2025-07-11 Haoxiang Guan , Jiyan He , Jie Zhang

Do Sparse Autoencoders Generalize? A Case Study of Answerability

Sparse autoencoders (SAEs) have emerged as a promising approach in language model interpretability, offering unsupervised extraction of sparse features. For interpretability methods to succeed, they must identify abstract features across…

Machine Learning · Computer Science 2025-09-08 Lovis Heindrich , Philip Torr , Fazl Barez , Veronika Thost

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses. Understanding their internal states is crucial for understanding their successes, diagnosing their failures,…

Computation and Language · Computer Science 2025-02-24 Xuansheng Wu , Jiayi Yuan , Wenlin Yao , Xiaoming Zhai , Ninghao Liu

Route Sparse Autoencoder to Interpret Large Language Models

Mechanistic interpretability of large language models (LLMs) aims to uncover the internal processes of information propagation and reasoning. Sparse autoencoders (SAEs) have demonstrated promise in this domain by extracting interpretable…

Machine Learning · Computer Science 2025-05-26 Wei Shi , Sihang Li , Tao Liang , Mingyang Wan , Guojun Ma , Xiang Wang , Xiangnan He

Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders

We study the challenge of achieving theoretically grounded feature recovery using Sparse Autoencoders (SAEs) for the interpretation of Large Language Models. Existing SAE training algorithms often lack rigorous mathematical guarantees and…

Machine Learning · Computer Science 2025-06-18 Siyu Chen , Heejune Sheen , Xuyuan Xiong , Tianhao Wang , Zhuoran Yang

Don't Forget It! Conditional Sparse Autoencoder Clamping Works for Unlearning

Recent developments in Large Language Model (LLM) capabilities have brought great potential but also posed new risks. For example, LLMs with knowledge of bioweapons, advanced chemistry, or cyberattacks could cause violence if placed in the…

Machine Learning · Computer Science 2025-03-17 Matthew Khoriaty , Andrii Shportko , Gustavo Mercier , Zach Wood-Doughty

Improving Dictionary Learning with Gated Sparse Autoencoders

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We…

Machine Learning · Computer Science 2024-05-01 Senthooran Rajamanoharan , Arthur Conmy , Lewis Smith , Tom Lieberum , Vikrant Varma , János Kramár , Rohin Shah , Neel Nanda

Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages

Understanding the multilingual mechanisms of large language models (LLMs) provides insight into how they process different languages, yet this remains challenging. Existing studies often focus on individual neurons, but their polysemantic…

Computation and Language · Computer Science 2026-02-24 Lyzander Marciano Andrylie , Inaya Rahmanisa , Mahardika Krisna Ihsani , Alfan Farizki Wicaksono , Haryo Akbarianto Wibowo , Alham Fikri Aji

AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations

Understanding the internal representations of large language models (LLMs) remains a central challenge for interpretability research. Sparse autoencoders (SAEs) offer a promising solution by decomposing activations into interpretable…

Machine Learning · Computer Science 2025-10-10 Yifei Yao , Mengnan Du

Learning Retrieval Models with Sparse Autoencoders

Sparse autoencoders (SAEs) provide a powerful mechanism for decomposing the dense representations produced by Large Language Models (LLMs) into interpretable latent features. We posit that SAEs constitute a natural foundation for Learned…

Machine Learning · Computer Science 2026-03-17 Thibault Formal , Maxime Louis , Hervé Dejean , Stéphane Clinchant