Related papers: Diversity-driven Data Selection for Language Model…

Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection

Sparse autoencoders (SAEs) enable feature-level mechanistic interpretability and activation steering in large language models (LLMs), but SAE-based language control remains unreliable in multilingual settings: most SAEs are trained on…

Computation and Language · Computer Science 2026-05-25 Yusser Al Ghussin , Daniil Gurgurov , Tanja Baeumel , Josef van Genabith , Patrick Schramowski , Simon Ostermann

Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

Analyzing large-scale text corpora is a core challenge in machine learning, crucial for tasks like identifying undesirable model behaviors or biases in training data. Current methods often rely on costly LLM-based techniques (e.g.…

Artificial Intelligence · Computer Science 2025-12-12 Nick Jiang , Xiaoqing Sun , Lisa Dunlap , Lewis Smith , Neel Nanda

Sparse Autoencoders Trained on the Same Data Learn Different Features

Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in the activations of large language models (LLMs). While some expect SAEs to find the true underlying features used by a model, our research shows…

Machine Learning · Computer Science 2025-01-31 Gonçalo Paulo , Nora Belrose

Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder

Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models (LLMs) by decomposing token activations into combinations of human-understandable features. While SAEs provide crucial insights into LLM…

Machine Learning · Computer Science 2025-11-11 Zhen Xu , Zhen Tan , Song Wang , Kaidi Xu , Tianlong Chen

Sparse Autoencoder Features for Classifications and Transferability

Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze…

Machine Learning · Computer Science 2026-02-03 Jack Gallifant , Shan Chen , Kuleen Sasse , Hugo Aerts , Thomas Hartvigsen , Danielle S. Bitterman

Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

Sparse Autoencoders (SAEs) have recently gained attention as a means to improve the interpretability and steerability of Large Language Models (LLMs), both of which are essential for AI safety. In this work, we extend the application of…

Computer Vision and Pattern Recognition · Computer Science 2025-12-01 Mateusz Pach , Shyamgopal Karthik , Quentin Bouniot , Serge Belongie , Zeynep Akata

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations. However, there is a lack of evidence regarding the validity of their interpretations due to the lack of a…

Machine Learning · Computer Science 2025-02-25 Subhash Kantamneni , Joshua Engels , Senthooran Rajamanoharan , Max Tegmark , Neel Nanda

Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders

We study the challenge of achieving theoretically grounded feature recovery using Sparse Autoencoders (SAEs) for the interpretation of Large Language Models. Existing SAE training algorithms often lack rigorous mathematical guarantees and…

Machine Learning · Computer Science 2025-06-18 Siyu Chen , Heejune Sheen , Xuyuan Xiong , Tianhao Wang , Zhuoran Yang

SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders

Sparse Autoencoders (SAEs) have become an important tool in mechanistic interpretability, helping to analyze internal representations in both Large Language Models (LLMs) and Vision Transformers (ViTs). By decomposing polysemantic…

Machine Learning · Computer Science 2026-05-11 Jakub Stępień , Marcin Mazur , Jacek Tabor , Przemysław Spurek

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering…

Machine Learning · Computer Science 2025-04-02 Jeffrey Olmo , Jared Wilson , Max Forsey , Bryce Hepner , Thomas Vin Howe , David Wingate

SplInterp: Improving our Understanding and Training of Sparse Autoencoders

Sparse autoencoders (SAEs) have received considerable recent attention as tools for mechanistic interpretability, showing success at extracting interpretable features even from very large LLMs. However, this research has been largely…

Machine Learning · Computer Science 2025-05-20 Jeremy Budd , Javier Ideami , Benjamin Macdowall Rynne , Keith Duggar , Randall Balestriero

Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders

The mechanisms behind multilingual capabilities in Large Language Models (LLMs) have been examined using neuron-based or internal-activation-based methods. However, these methods often face challenges such as superposition and layer-wise…

Computation and Language · Computer Science 2025-05-28 Boyi Deng , Yu Wan , Yidan Zhang , Baosong Yang , Fuli Feng

Empirical Evaluation of Progressive Coding for Sparse Autoencoders

Sparse autoencoders (SAEs) \citep{bricken2023monosemanticity,gao2024scalingevaluatingsparseautoencoders} rely on dictionary learning to extract interpretable features from neural networks at scale in an unsupervised manner, with…

Machine Learning · Computer Science 2025-05-02 Hans Peter , Anders Søgaard

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model…

Machine Learning · Computer Science 2026-05-27 Yi Jing , Zao Dai , Jinwu Hu , Zijun Yao , Lei Hou , Juanzi Li , Xiaozhi Wang

Can sparse autoencoders make sense of gene expression latent variable models?

Sparse autoencoders (SAEs) have lately been used to uncover interpretable latent features in large language models. By projecting dense embeddings into a much higher-dimensional and sparse space, learned features become disentangled and…

Machine Learning · Computer Science 2025-07-30 Viktoria Schuster

Where to Pay Attention in Sparse Training for Feature Selection?

A new line of research for feature selection based on neural networks has recently emerged. Despite its superiority to classical methods, it requires many training iterations to converge and detect informative features. The computational…

Machine Learning · Computer Science 2022-11-29 Ghada Sokar , Zahra Atashgahi , Mykola Pechenizkiy , Decebal Constantin Mocanu

Improving Dictionary Learning with Gated Sparse Autoencoders

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We…

Machine Learning · Computer Science 2024-05-01 Senthooran Rajamanoharan , Arthur Conmy , Lewis Smith , Tom Lieberum , Vikrant Varma , János Kramár , Rohin Shah , Neel Nanda

SAEs Are Good for Steering -- If You Select the Right Features

Sparse Autoencoders (SAEs) have been proposed as an unsupervised approach to learn a decomposition of a model's latent space. This enables useful applications such as steering - influencing the output of a model towards a desired concept -…

Machine Learning · Computer Science 2025-12-23 Dana Arad , Aaron Mueller , Yonatan Belinkov

Don't Forget It! Conditional Sparse Autoencoder Clamping Works for Unlearning

Recent developments in Large Language Model (LLM) capabilities have brought great potential but also posed new risks. For example, LLMs with knowledge of bioweapons, advanced chemistry, or cyberattacks could cause violence if placed in the…

Machine Learning · Computer Science 2025-03-17 Matthew Khoriaty , Andrii Shportko , Gustavo Mercier , Zach Wood-Doughty

Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning

Enhancing the instruction-following ability of Large Language Models (LLMs) primarily demands substantial instruction-tuning datasets. However, the sheer volume of these imposes a considerable computational burden and annotation cost. To…

Computation and Language · Computer Science 2023-11-15 Shengguang Wu , Keming Lu , Benfeng Xu , Junyang Lin , Qi Su , Chang Zhou