English
Related papers

Related papers: Learning Retrieval Models with Sparse Autoencoders

200 papers

Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze…

Machine Learning · Computer Science 2026-02-03 Jack Gallifant , Shan Chen , Kuleen Sasse , Hugo Aerts , Thomas Hartvigsen , Danielle S. Bitterman

Sparse Autoencoders (SAEs) have emerged as a popular tool for interpreting the hidden states of large language models (LLMs). By learning to reconstruct activations from a sparse bottleneck layer, SAEs discover interpretable features from…

Computer Vision and Pattern Recognition · Computer Science 2025-09-19 Matthew Lyle Olson , Musashi Hinck , Neale Ratzlaff , Changbai Li , Phillip Howard , Vasudev Lal , Shao-Yen Tseng

Sparse autoencoders (SAEs) have lately been used to uncover interpretable latent features in large language models. By projecting dense embeddings into a much higher-dimensional and sparse space, learned features become disentangled and…

Machine Learning · Computer Science 2025-07-30 Viktoria Schuster

Despite their strong performance, Dense Passage Retrieval (DPR) models suffer from a lack of interpretability. In this work, we propose a novel interpretability framework that leverages Sparse Autoencoders (SAEs) to decompose previously…

Information Retrieval · Computer Science 2025-08-28 Seongwan Park , Taeklim Kim , Youngjoong Ko

Vision foundation models (FMs) achieve state-of-the-art performance in medical imaging. However, they encode information in abstract latent representations that clinicians cannot interrogate or verify. The goal of this study is to…

Computer Vision and Pattern Recognition · Computer Science 2026-03-26 Philipp Wesp , Robbie Holland , Vasiliki Sideri-Lampretsa , Sergios Gatidis

Large Language Models (LLMs) have transformed natural language processing, yet their internal mechanisms remain largely opaque. Recently, mechanistic interpretability has attracted significant attention from the research community as a…

Machine Learning · Computer Science 2025-09-24 Dong Shu , Xuansheng Wu , Haiyan Zhao , Daking Rai , Ziyu Yao , Ninghao Liu , Mengnan Du

Sparse autoencoders (SAEs) enable feature-level mechanistic interpretability and activation steering in large language models (LLMs), but SAE-based language control remains unreliable in multilingual settings: most SAEs are trained on…

Computation and Language · Computer Science 2026-05-25 Yusser Al Ghussin , Daniil Gurgurov , Tanja Baeumel , Josef van Genabith , Patrick Schramowski , Simon Ostermann

Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models, yet scalable training remains a significant challenge. We introduce a suite of 256 SAEs, trained on each…

Understanding the internal representations of large language models (LLMs) remains a central challenge for interpretability research. Sparse autoencoders (SAEs) offer a promising solution by decomposing activations into interpretable…

Machine Learning · Computer Science 2025-10-10 Yifei Yao , Mengnan Du

Mechanistic interpretability of large language models (LLMs) aims to uncover the internal processes of information propagation and reasoning. Sparse autoencoders (SAEs) have demonstrated promise in this domain by extracting interpretable…

Machine Learning · Computer Science 2025-05-26 Wei Shi , Sihang Li , Tao Liang , Mingyang Wan , Guojun Ma , Xiang Wang , Xiangnan He

Understanding the multilingual mechanisms of large language models (LLMs) provides insight into how they process different languages, yet this remains challenging. Existing studies often focus on individual neurons, but their polysemantic…

Recent developments in Large Language Model (LLM) capabilities have brought great potential but also posed new risks. For example, LLMs with knowledge of bioweapons, advanced chemistry, or cyberattacks could cause violence if placed in the…

Machine Learning · Computer Science 2025-03-17 Matthew Khoriaty , Andrii Shportko , Gustavo Mercier , Zach Wood-Doughty

The mechanisms behind multilingual capabilities in Large Language Models (LLMs) have been examined using neuron-based or internal-activation-based methods. However, these methods often face challenges such as superposition and layer-wise…

Computation and Language · Computer Science 2025-05-28 Boyi Deng , Yu Wan , Yidan Zhang , Baosong Yang , Fuli Feng

Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in the activations of large language models (LLMs). While some expect SAEs to find the true underlying features used by a model, our research shows…

Machine Learning · Computer Science 2025-01-31 Gonçalo Paulo , Nora Belrose

Sparse autoencoders (SAEs) have recently emerged as a powerful tool for interpreting the internal representations of large language models (LLMs), revealing latent latent features with semantical meaning. This interpretability has also…

Other Quantitative Biology · Quantitative Biology 2025-07-11 Haoxiang Guan , Jiyan He , Jie Zhang

Unsupervised approaches to large language model (LLM) interpretability, such as sparse autoencoders (SAEs), offer a way to decode LLM activations into interpretable and, ideally, controllable concepts. On the one hand, these approaches…

Machine Learning · Computer Science 2026-03-03 Shruti Joshi , Andrea Dittadi , Sébastien Lachapelle , Dhanya Sridhar

Sparse autoencoders (SAEs) have emerged as powerful techniques for interpretability of large language models (LLMs), aiming to decompose hidden states into meaningful semantic features. While several SAE variants have been proposed, there…

Machine Learning · Computer Science 2025-10-03 Xudong Zhu , Mohammad Mahdi Khalili , Zhihui Zhu

Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations. However, there is a lack of evidence regarding the validity of their interpretations due to the lack of a…

Machine Learning · Computer Science 2025-02-25 Subhash Kantamneni , Joshua Engels , Senthooran Rajamanoharan , Max Tegmark , Neel Nanda

We study the challenge of achieving theoretically grounded feature recovery using Sparse Autoencoders (SAEs) for the interpretation of Large Language Models. Existing SAE training algorithms often lack rigorous mathematical guarantees and…

Machine Learning · Computer Science 2025-06-18 Siyu Chen , Heejune Sheen , Xuyuan Xiong , Tianhao Wang , Zhuoran Yang

As large language models (LLMs) grow in scale and capability, understanding their internal mechanisms becomes increasingly critical. Sparse autoencoders (SAEs) have emerged as a key tool in mechanistic interpretability, enabling the…

Computation and Language · Computer Science 2025-06-10 Jiaming Li , Haoran Ye , Yukun Chen , Xinyue Li , Lei Zhang , Hamid Alinejad-Rokny , Jimmy Chih-Hsien Peng , Min Yang
‹ Prev 1 2 3 10 Next ›