English
Related papers

Related papers: Sparse Autoencoders Trained on the Same Data Learn…

200 papers

Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models, yet scalable training remains a significant challenge. We introduce a suite of 256 SAEs, trained on each…

Sparse autoencoders (SAEs) have lately been used to uncover interpretable latent features in large language models. By projecting dense embeddings into a much higher-dimensional and sparse space, learned features become disentangled and…

Machine Learning · Computer Science 2025-07-30 Viktoria Schuster

Understanding the multilingual mechanisms of large language models (LLMs) provides insight into how they process different languages, yet this remains challenging. Existing studies often focus on individual neurons, but their polysemantic…

The mechanisms behind multilingual capabilities in Large Language Models (LLMs) have been examined using neuron-based or internal-activation-based methods. However, these methods often face challenges such as superposition and layer-wise…

Computation and Language · Computer Science 2025-05-28 Boyi Deng , Yu Wan , Yidan Zhang , Baosong Yang , Fuli Feng

Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze…

Machine Learning · Computer Science 2026-02-03 Jack Gallifant , Shan Chen , Kuleen Sasse , Hugo Aerts , Thomas Hartvigsen , Danielle S. Bitterman

Sparse autoencoders (SAEs) have recently emerged as a powerful tool for interpreting the internal representations of large language models (LLMs), revealing latent latent features with semantical meaning. This interpretability has also…

Other Quantitative Biology · Quantitative Biology 2025-07-11 Haoxiang Guan , Jiyan He , Jie Zhang

Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models (LLMs) by decomposing token activations into combinations of human-understandable features. While SAEs provide crucial insights into LLM…

Machine Learning · Computer Science 2025-11-11 Zhen Xu , Zhen Tan , Song Wang , Kaidi Xu , Tianlong Chen

Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features. Typically, features learned by a single SAE are used for downstream applications. However, it has recently been shown that SAEs…

Machine Learning · Computer Science 2025-05-23 Soham Gadgil , Chris Lin , Su-In Lee

Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations. However, there is a lack of evidence regarding the validity of their interpretations due to the lack of a…

Machine Learning · Computer Science 2025-02-25 Subhash Kantamneni , Joshua Engels , Senthooran Rajamanoharan , Max Tegmark , Neel Nanda

Sparse Autoencoders (SAEs) have been proposed as an unsupervised approach to learn a decomposition of a model's latent space. This enables useful applications such as steering - influencing the output of a model towards a desired concept -…

Machine Learning · Computer Science 2025-12-23 Dana Arad , Aaron Mueller , Yonatan Belinkov

Sparse Autoencoders (SAEs) have emerged as a popular tool for interpreting the hidden states of large language models (LLMs). By learning to reconstruct activations from a sparse bottleneck layer, SAEs discover interpretable features from…

Computer Vision and Pattern Recognition · Computer Science 2025-09-19 Matthew Lyle Olson , Musashi Hinck , Neale Ratzlaff , Changbai Li , Phillip Howard , Vasudev Lal , Shao-Yen Tseng

Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses. Understanding their internal states is crucial for understanding their successes, diagnosing their failures,…

Computation and Language · Computer Science 2025-02-24 Xuansheng Wu , Jiayi Yuan , Wenlin Yao , Xiaoming Zhai , Ninghao Liu

While the activations of neurons in deep neural networks usually do not have a simple human-understandable interpretation, sparse autoencoders (SAEs) can be used to transform these activations into a higher-dimensional latent space which…

Machine Learning · Computer Science 2025-08-07 Gonçalo Paulo , Alex Mallen , Caden Juang , Nora Belrose

Large Language Models (LLMs) encode factual knowledge within hidden parametric spaces that are difficult to inspect or control. While Sparse Autoencoders (SAEs) can decompose hidden activations into more fine-grained, interpretable…

Machine Learning · Computer Science 2026-01-14 Minglai Yang , Xinyu Guo , Zhengliang Shi , Jinhe Bi , Steven Bethard , Mihai Surdeanu , Liangming Pan

Large Language Models (LLMs) have transformed natural language processing, yet their internal mechanisms remain largely opaque. Recently, mechanistic interpretability has attracted significant attention from the research community as a…

Machine Learning · Computer Science 2025-09-24 Dong Shu , Xuansheng Wu , Haiyan Zhao , Daking Rai , Ziyu Yao , Ninghao Liu , Mengnan Du

Sparse autoencoders (SAEs) decompose large language model (LLM) activations into latent features that reveal mechanistic structure. Conventional SAEs train on broad data distributions, forcing a fixed latent budget to capture only…

Machine Learning · Computer Science 2025-08-14 Charles O'Neill , Mudith Jayasekara , Max Kirkby

Mechanistic interpretability of large language models (LLMs) aims to uncover the internal processes of information propagation and reasoning. Sparse autoencoders (SAEs) have demonstrated promise in this domain by extracting interpretable…

Machine Learning · Computer Science 2025-05-26 Wei Shi , Sihang Li , Tao Liang , Mingyang Wan , Guojun Ma , Xiang Wang , Xiangnan He

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We…

Machine Learning · Computer Science 2024-05-01 Senthooran Rajamanoharan , Arthur Conmy , Lewis Smith , Tom Lieberum , Vikrant Varma , János Kramár , Rohin Shah , Neel Nanda

As large language models (LLMs) grow in scale and capability, understanding their internal mechanisms becomes increasingly critical. Sparse autoencoders (SAEs) have emerged as a key tool in mechanistic interpretability, enabling the…

Computation and Language · Computer Science 2025-06-10 Jiaming Li , Haoran Ye , Yukun Chen , Xinyue Li , Lei Zhang , Hamid Alinejad-Rokny , Jimmy Chih-Hsien Peng , Min Yang

Unsupervised approaches to large language model (LLM) interpretability, such as sparse autoencoders (SAEs), offer a way to decode LLM activations into interpretable and, ideally, controllable concepts. On the one hand, these approaches…

Machine Learning · Computer Science 2026-03-03 Shruti Joshi , Andrea Dittadi , Sébastien Lachapelle , Dhanya Sridhar
‹ Prev 1 2 3 10 Next ›