Related papers: Self-Regularization with Sparse Autoencoders for C…

AlignSAE: Concept-Aligned Sparse Autoencoders

Large Language Models (LLMs) encode factual knowledge within hidden parametric spaces that are difficult to inspect or control. While Sparse Autoencoders (SAEs) can decompose hidden activations into more fine-grained, interpretable…

Machine Learning · Computer Science 2026-01-14 Minglai Yang , Xinyu Guo , Zhengliang Shi , Jinhe Bi , Steven Bethard , Mihai Surdeanu , Liangming Pan

A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models

Large Language Models (LLMs) have transformed natural language processing, yet their internal mechanisms remain largely opaque. Recently, mechanistic interpretability has attracted significant attention from the research community as a…

Machine Learning · Computer Science 2025-09-24 Dong Shu , Xuansheng Wu , Haiyan Zhao , Daking Rai , Ziyu Yao , Ninghao Liu , Mengnan Du

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses. Understanding their internal states is crucial for understanding their successes, diagnosing their failures,…

Computation and Language · Computer Science 2025-02-24 Xuansheng Wu , Jiayi Yuan , Wenlin Yao , Xiaoming Zhai , Ninghao Liu

Sparse Autoencoder Features for Classifications and Transferability

Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze…

Machine Learning · Computer Science 2026-02-03 Jack Gallifant , Shan Chen , Kuleen Sasse , Hugo Aerts , Thomas Hartvigsen , Danielle S. Bitterman

Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders

Sparse Autoencoders (SAEs) have been successfully used to probe Large Language Models (LLMs) and extract interpretable concepts from their internal representations. These concepts are linear combinations of neuron activations that…

Computation and Language · Computer Science 2026-02-23 Mathis Le Bail , Jérémie Dentan , Davide Buscaldi , Sonia Vanier

Don't Forget It! Conditional Sparse Autoencoder Clamping Works for Unlearning

Recent developments in Large Language Model (LLM) capabilities have brought great potential but also posed new risks. For example, LLMs with knowledge of bioweapons, advanced chemistry, or cyberattacks could cause violence if placed in the…

Machine Learning · Computer Science 2025-03-17 Matthew Khoriaty , Andrii Shportko , Gustavo Mercier , Zach Wood-Doughty

Sparse Shift Autoencoders for Identifying Concepts from Large Language Model Activations

Unsupervised approaches to large language model (LLM) interpretability, such as sparse autoencoders (SAEs), offer a way to decode LLM activations into interpretable and, ideally, controllable concepts. On the one hand, these approaches…

Machine Learning · Computer Science 2026-03-03 Shruti Joshi , Andrea Dittadi , Sébastien Lachapelle , Dhanya Sridhar

Route Sparse Autoencoder to Interpret Large Language Models

Mechanistic interpretability of large language models (LLMs) aims to uncover the internal processes of information propagation and reasoning. Sparse autoencoders (SAEs) have demonstrated promise in this domain by extracting interpretable…

Machine Learning · Computer Science 2025-05-26 Wei Shi , Sihang Li , Tao Liang , Mingyang Wan , Guojun Ma , Xiang Wang , Xiangnan He

Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

Analyzing large-scale text corpora is a core challenge in machine learning, crucial for tasks like identifying undesirable model behaviors or biases in training data. Current methods often rely on costly LLM-based techniques (e.g.…

Artificial Intelligence · Computer Science 2025-12-12 Nick Jiang , Xiaoqing Sun , Lisa Dunlap , Lewis Smith , Neel Nanda

Automatically Interpreting Millions of Features in Large Language Models

While the activations of neurons in deep neural networks usually do not have a simple human-understandable interpretation, sparse autoencoders (SAEs) can be used to transform these activations into a higher-dimensional latent space which…

Machine Learning · Computer Science 2025-08-07 Gonçalo Paulo , Alex Mallen , Caden Juang , Nora Belrose

Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

Artificial Text Detection (ATD) is becoming increasingly important with the rise of advanced Large Language Models (LLMs). Despite numerous efforts, no single algorithm performs consistently well across different types of unseen text or…

Computation and Language · Computer Science 2025-03-17 Kristian Kuznetsov , Laida Kushnareva , Polina Druzhinina , Anton Razzhigaev , Anastasia Voznyuk , Irina Piontkovskaya , Evgeny Burnaev , Serguei Barannikov

Measuring and Guiding Monosemanticity

There is growing interest in leveraging mechanistic interpretability and controllability to better understand and influence the internal dynamics of large language models (LLMs). However, current methods face fundamental challenges in…

Computation and Language · Computer Science 2025-12-02 Ruben Härle , Felix Friedrich , Manuel Brack , Stephan Wäldchen , Björn Deiseroth , Patrick Schramowski , Kristian Kersting

Improving Robustness In Sparse Autoencoders via Masked Regularization

Sparse autoencoders (SAEs) are widely used in mechanistic interpretability to project LLM activations onto sparse latent spaces. However, sparsity alone is an imperfect proxy for interpretability, and current training objectives often…

Machine Learning · Computer Science 2026-04-09 Vivek Narayanaswamy , Kowshik Thopalli , Bhavya Kailkhura , Wesam Sakla

SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but controlling their behavior reliably remains challenging, especially in open-ended generation settings. This paper…

Computation and Language · Computer Science 2025-12-08 Zirui He , Mingyu Jin , Bo Shen , Ali Payani , Yongfeng Zhang , Mengnan Du

Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders

We study the challenge of achieving theoretically grounded feature recovery using Sparse Autoencoders (SAEs) for the interpretation of Large Language Models. Existing SAE training algorithms often lack rigorous mathematical guarantees and…

Machine Learning · Computer Science 2025-06-18 Siyu Chen , Heejune Sheen , Xuyuan Xiong , Tianhao Wang , Zhuoran Yang

I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Recent LLMs like DeepSeek-R1 have demonstrated state-of-the-art performance by integrating deep thinking and complex reasoning during generation. However, the internal mechanisms behind these reasoning processes remain unexplored. We…

Computation and Language · Computer Science 2025-08-07 Andrey Galichin , Alexey Dontsov , Polina Druzhinina , Anton Razzhigaev , Oleg Y. Rogov , Elena Tutubalina , Ivan Oseledets

Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders

The mechanisms behind multilingual capabilities in Large Language Models (LLMs) have been examined using neuron-based or internal-activation-based methods. However, these methods often face challenges such as superposition and layer-wise…

Computation and Language · Computer Science 2025-05-28 Boyi Deng , Yu Wan , Yidan Zhang , Baosong Yang , Fuli Feng

Improving Dictionary Learning with Gated Sparse Autoencoders

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We…

Machine Learning · Computer Science 2024-05-01 Senthooran Rajamanoharan , Arthur Conmy , Lewis Smith , Tom Lieberum , Vikrant Varma , János Kramár , Rohin Shah , Neel Nanda

Step-Level Sparse Autoencoder for Reasoning Process Interpretation

Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as…

Machine Learning · Computer Science 2026-03-04 Xuan Yang , Jiayu Liu , Yuhang Lai , Hao Xu , Zhenya Huang , Ning Miao

Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement

Steering has emerged as a promising approach in controlling large language models (LLMs) without modifying model parameters. However, most existing steering methods rely on large-scale datasets to learn clear behavioral information, which…

Machine Learning · Computer Science 2025-10-06 Anyi Wang , Xuansheng Wu , Dong Shu , Yunpu Ma , Ninghao Liu