Related papers: AlignSAE: Concept-Aligned Sparse Autoencoders

Sparse Shift Autoencoders for Identifying Concepts from Large Language Model Activations

Unsupervised approaches to large language model (LLM) interpretability, such as sparse autoencoders (SAEs), offer a way to decode LLM activations into interpretable and, ideally, controllable concepts. On the one hand, these approaches…

Machine Learning · Computer Science 2026-03-03 Shruti Joshi , Andrea Dittadi , Sébastien Lachapelle , Dhanya Sridhar

Enabling Precise Topic Alignment in Large Language Models Via Sparse Autoencoders

Recent work shows that Sparse Autoencoders (SAE) applied to large language model (LLM) layers have neurons corresponding to interpretable concepts. These SAE neurons can be modified to align generated outputs, but only towards…

Computation and Language · Computer Science 2025-07-01 Ananya Joshi , Celia Cintas , Skyler Speakman

Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders

Sparse Autoencoders (SAEs) have been successfully used to probe Large Language Models (LLMs) and extract interpretable concepts from their internal representations. These concepts are linear combinations of neuron activations that…

Computation and Language · Computer Science 2026-02-23 Mathis Le Bail , Jérémie Dentan , Davide Buscaldi , Sonia Vanier

A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models

Large Language Models (LLMs) have transformed natural language processing, yet their internal mechanisms remain largely opaque. Recently, mechanistic interpretability has attracted significant attention from the research community as a…

Machine Learning · Computer Science 2025-09-24 Dong Shu , Xuansheng Wu , Haiyan Zhao , Daking Rai , Ziyu Yao , Ninghao Liu , Mengnan Du

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses. Understanding their internal states is crucial for understanding their successes, diagnosing their failures,…

Computation and Language · Computer Science 2025-02-24 Xuansheng Wu , Jiayi Yuan , Wenlin Yao , Xiaoming Zhai , Ninghao Liu

ProtSAE: Disentangling and Interpreting Protein Language Models via Semantically-Guided Sparse Autoencoders

Sparse Autoencoder (SAE) has emerged as a powerful tool for mechanistic interpretability of large language models. Recent works apply SAE to protein language models (PLMs), aiming to extract and analyze biologically meaningful features from…

Quantitative Methods · Quantitative Biology 2026-01-21 Xiangyu Liu , Haodi Lei , Yi Liu , Yang Liu , Wei Hu

Don't Forget It! Conditional Sparse Autoencoder Clamping Works for Unlearning

Recent developments in Large Language Model (LLM) capabilities have brought great potential but also posed new risks. For example, LLMs with knowledge of bioweapons, advanced chemistry, or cyberattacks could cause violence if placed in the…

Machine Learning · Computer Science 2025-03-17 Matthew Khoriaty , Andrii Shportko , Gustavo Mercier , Zach Wood-Doughty

Sparse Autoencoder Features for Classifications and Transferability

Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze…

Machine Learning · Computer Science 2026-02-03 Jack Gallifant , Shan Chen , Kuleen Sasse , Hugo Aerts , Thomas Hartvigsen , Danielle S. Bitterman

Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification

Modern text classification methods heavily rely on contextual embeddings from large language models (LLMs). Compared to human-engineered features, these embeddings provide automatic and effective representations for classification model…

Computation and Language · Computer Science 2025-07-29 Xuansheng Wu , Wenhao Yu , Xiaoming Zhai , Ninghao Liu

Sparse Autoencoders Trained on the Same Data Learn Different Features

Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in the activations of large language models (LLMs). While some expect SAEs to find the true underlying features used by a model, our research shows…

Machine Learning · Computer Science 2025-01-31 Gonçalo Paulo , Nora Belrose

Step-Level Sparse Autoencoder for Reasoning Process Interpretation

Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as…

Machine Learning · Computer Science 2026-03-04 Xuan Yang , Jiayu Liu , Yuhang Lai , Hao Xu , Zhenya Huang , Ning Miao

Probing the Representational Power of Sparse Autoencoders in Vision Models

Sparse Autoencoders (SAEs) have emerged as a popular tool for interpreting the hidden states of large language models (LLMs). By learning to reconstruct activations from a sparse bottleneck layer, SAEs discover interpretable features from…

Computer Vision and Pattern Recognition · Computer Science 2025-09-19 Matthew Lyle Olson , Musashi Hinck , Neale Ratzlaff , Changbai Li , Phillip Howard , Vasudev Lal , Shao-Yen Tseng

Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders

The mechanisms behind multilingual capabilities in Large Language Models (LLMs) have been examined using neuron-based or internal-activation-based methods. However, these methods often face challenges such as superposition and layer-wise…

Computation and Language · Computer Science 2025-05-28 Boyi Deng , Yu Wan , Yidan Zhang , Baosong Yang , Fuli Feng

Improving Dictionary Learning with Gated Sparse Autoencoders

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We…

Machine Learning · Computer Science 2024-05-01 Senthooran Rajamanoharan , Arthur Conmy , Lewis Smith , Tom Lieberum , Vikrant Varma , János Kramár , Rohin Shah , Neel Nanda

AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations

Understanding the internal representations of large language models (LLMs) remains a central challenge for interpretability research. Sparse autoencoders (SAEs) offer a promising solution by decomposing activations into interpretable…

Machine Learning · Computer Science 2025-10-10 Yifei Yao , Mengnan Du

Automatically Interpreting Millions of Features in Large Language Models

While the activations of neurons in deep neural networks usually do not have a simple human-understandable interpretation, sparse autoencoders (SAEs) can be used to transform these activations into a higher-dimensional latent space which…

Machine Learning · Computer Science 2025-08-07 Gonçalo Paulo , Alex Mallen , Caden Juang , Nora Belrose

Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography

Intermediate layers of large language models (LLMs) best predict human brain responses to language, one of the most robust findings in computational neurolinguistics, yet why remains mechanistically unexplained. We address this gap by…

Computation and Language · Computer Science 2026-05-25 Dongxin Guo , Jikun Wu , Siu Ming Yiu

SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models

The ability of large language models (LLMs) to follow instructions is crucial for their practical applications, yet the underlying mechanisms remain poorly understood. This paper presents a novel framework that leverages sparse autoencoders…

Machine Learning · Computer Science 2025-02-18 Zirui He , Haiyan Zhao , Yiran Qiao , Fan Yang , Ali Payani , Jing Ma , Mengnan Du

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

Sparse Autoencoders (SAEs) have recently gained attention as a means to improve the interpretability and steerability of Large Language Models (LLMs), both of which are essential for AI safety. In this work, we extend the application of…

Computer Vision and Pattern Recognition · Computer Science 2025-12-01 Mateusz Pach , Shyamgopal Karthik , Quentin Bouniot , Serge Belongie , Zeynep Akata

Route Sparse Autoencoder to Interpret Large Language Models

Mechanistic interpretability of large language models (LLMs) aims to uncover the internal processes of information propagation and reasoning. Sparse autoencoders (SAEs) have demonstrated promise in this domain by extracting interpretable…

Machine Learning · Computer Science 2025-05-26 Wei Shi , Sihang Li , Tao Liang , Mingyang Wan , Guojun Ma , Xiang Wang , Xiangnan He