Related papers: Sparse Autoencoders for Hypothesis Generation

Can sparse autoencoders make sense of gene expression latent variable models?

Sparse autoencoders (SAEs) have lately been used to uncover interpretable latent features in large language models. By projecting dense embeddings into a much higher-dimensional and sparse space, learned features become disentangled and…

Machine Learning · Computer Science 2025-07-30 Viktoria Schuster

Step-Level Sparse Autoencoder for Reasoning Process Interpretation

Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as…

Machine Learning · Computer Science 2026-03-04 Xuan Yang , Jiayu Liu , Yuhang Lai , Hao Xu , Zhenya Huang , Ning Miao

Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

Analyzing large-scale text corpora is a core challenge in machine learning, crucial for tasks like identifying undesirable model behaviors or biases in training data. Current methods often rely on costly LLM-based techniques (e.g.…

Artificial Intelligence · Computer Science 2025-12-12 Nick Jiang , Xiaoqing Sun , Lisa Dunlap , Lewis Smith , Neel Nanda

Automatically Interpreting Millions of Features in Large Language Models

While the activations of neurons in deep neural networks usually do not have a simple human-understandable interpretation, sparse autoencoders (SAEs) can be used to transform these activations into a higher-dimensional latent space which…

Machine Learning · Computer Science 2025-08-07 Gonçalo Paulo , Alex Mallen , Caden Juang , Nora Belrose

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations. However, there is a lack of evidence regarding the validity of their interpretations due to the lack of a…

Machine Learning · Computer Science 2025-02-25 Subhash Kantamneni , Joshua Engels , Senthooran Rajamanoharan , Max Tegmark , Neel Nanda

Evaluating SAE interpretability without explanations

Sparse autoencoders (SAEs) and transcoders have become important tools for machine learning interpretability. However, measuring how interpretable they are remains challenging, with weak consensus about which benchmarks to use. Most…

Machine Learning · Computer Science 2025-07-14 Gonçalo Paulo , Nora Belrose

Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages

Understanding the multilingual mechanisms of large language models (LLMs) provides insight into how they process different languages, yet this remains challenging. Existing studies often focus on individual neurons, but their polysemantic…

Computation and Language · Computer Science 2026-02-24 Lyzander Marciano Andrylie , Inaya Rahmanisa , Mahardika Krisna Ihsani , Alfan Farizki Wicaksono , Haryo Akbarianto Wibowo , Alham Fikri Aji

Sparse Shift Autoencoders for Identifying Concepts from Large Language Model Activations

Unsupervised approaches to large language model (LLM) interpretability, such as sparse autoencoders (SAEs), offer a way to decode LLM activations into interpretable and, ideally, controllable concepts. On the one hand, these approaches…

Machine Learning · Computer Science 2026-03-03 Shruti Joshi , Andrea Dittadi , Sébastien Lachapelle , Dhanya Sridhar

I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Recent LLMs like DeepSeek-R1 have demonstrated state-of-the-art performance by integrating deep thinking and complex reasoning during generation. However, the internal mechanisms behind these reasoning processes remain unexplored. We…

Computation and Language · Computer Science 2025-08-07 Andrey Galichin , Alexey Dontsov , Polina Druzhinina , Anton Razzhigaev , Oleg Y. Rogov , Elena Tutubalina , Ivan Oseledets

Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders

Sparse Autoencoders (SAEs) have been successfully used to probe Large Language Models (LLMs) and extract interpretable concepts from their internal representations. These concepts are linear combinations of neuron activations that…

Computation and Language · Computer Science 2026-02-23 Mathis Le Bail , Jérémie Dentan , Davide Buscaldi , Sonia Vanier

Learning and Evaluating Sparse Interpretable Sentence Embeddings

Previous research on word embeddings has shown that sparse representations, which can be either learned on top of existing dense embeddings or obtained through model constraints during training time, have the benefit of increased…

Computation and Language · Computer Science 2018-09-26 Valentin Trifonov , Octavian-Eugen Ganea , Anna Potapenko , Thomas Hofmann

AlignSAE: Concept-Aligned Sparse Autoencoders

Large Language Models (LLMs) encode factual knowledge within hidden parametric spaces that are difficult to inspect or control. While Sparse Autoencoders (SAEs) can decompose hidden activations into more fine-grained, interpretable…

Machine Learning · Computer Science 2026-01-14 Minglai Yang , Xinyu Guo , Zhengliang Shi , Jinhe Bi , Steven Bethard , Mihai Surdeanu , Liangming Pan

Hybrid Embedded Deep Stacked Sparse Autoencoder with w_LPPD SVM Ensemble

Deep learning is a kind of feature learning method with strong nonliear feature transformation and becomes more and more important in many fields of artificial intelligence. Deep autoencoder is one representative method of the deep learning…

Machine Learning · Computer Science 2020-02-18 Yongming Li , Yan Lei , Pin Wang , Yuchuan Liu

Interpretable and Testable Vision Features via Sparse Autoencoders

To truly understand vision models, we must not only interpret their learned features but also validate these interpretations through controlled experiments. While earlier work offers either rich semantics or direct control, few post-hoc…

Computer Vision and Pattern Recognition · Computer Science 2025-11-25 Samuel Stevens , Wei-Lun Chao , Tanya Berger-Wolf , Yu Su

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses. Understanding their internal states is crucial for understanding their successes, diagnosing their failures,…

Computation and Language · Computer Science 2025-02-24 Xuansheng Wu , Jiayi Yuan , Wenlin Yao , Xiaoming Zhai , Ninghao Liu

Do Sparse Autoencoders Identify Reasoning Features in Language Models?

We study how reliably sparse autoencoders (SAEs) support claims about reasoning-related internal features in large language models. We first give a stylized analysis showing that sparsity-regularized decoding can preferentially retain…

Machine Learning · Computer Science 2026-05-19 George Ma , Zhongyuan Liang , Irene Y. Chen , Somayeh Sojoudi

From Atoms to Trees: Building a Structured Feature Forest with Hierarchical Sparse Autoencoders

Sparse autoencoders (SAEs) have proven effective for extracting monosemantic features from large language models (LLMs), yet these features are typically identified in isolation. However, broad evidence suggests that LLMs capture the…

Artificial Intelligence · Computer Science 2026-02-13 Yifan Luo , Yang Zhan , Jiedong Jiang , Tianyang Liu , Mingrui Wu , Zhennan Zhou , Bin Dong

SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders

Sparse Autoencoders (SAEs) have become an important tool in mechanistic interpretability, helping to analyze internal representations in both Large Language Models (LLMs) and Vision Transformers (ViTs). By decomposing polysemantic…

Machine Learning · Computer Science 2026-05-11 Jakub Stępień , Marcin Mazur , Jacek Tabor , Przemysław Spurek

Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words

Sparse autoencoders (SAEs) have gained a lot of attention as a promising tool to improve the interpretability of large language models (LLMs) by mapping the complex superposition of polysemantic neurons into monosemantic features and…

Computation and Language · Computer Science 2025-02-19 Gouki Minegishi , Hiroki Furuta , Yusuke Iwasawa , Yutaka Matsuo

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising…

Computation and Language · Computer Science 2026-02-27 Usha Bhalla , Alex Oesterling , Claudio Mayrink Verdun , Himabindu Lakkaraju , Flavio P. Calmon