English
Related papers

Related papers: Enhancing LLM Steering through Sparse Autoencoder-…

200 papers

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but controlling their behavior reliably remains challenging, especially in open-ended generation settings. This paper…

Computation and Language · Computer Science 2025-12-08 Zirui He , Mingyu Jin , Bo Shen , Ali Payani , Yongfeng Zhang , Mengnan Du

Large Reasoning Models (LRMs) exhibit human-like cognitive reasoning strategies (e.g. backtracking, cross-verification) during reasoning process, which improves their performance on complex tasks. Currently, reasoning strategies are…

Artificial Intelligence · Computer Science 2026-01-08 Yi Fang , Wenjie Wang , Mingfeng Xue , Boyi Deng , Fengli Xu , Dayiheng Liu , Fuli Feng

Linear concept vectors effectively steer LLMs, but existing methods suffer from noisy features in diverse datasets that undermine steering robustness. We propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectively keep the…

Computation and Language · Computer Science 2025-07-31 Haiyan Zhao , Xuansheng Wu , Fan Yang , Bo Shen , Ninghao Liu , Mengnan Du

Steering vectors are a promising approach to control the behaviour of large language models. However, their underlying mechanisms remain poorly understood. While sparse autoencoders (SAEs) may offer a potential method to interpret steering…

Machine Learning · Computer Science 2024-11-14 Harry Mayne , Yushi Yang , Adam Mahdi

To control the behavior of language models, steering methods attempt to ensure that outputs of the model satisfy specific pre-defined properties. Adding steering vectors to the model is a promising method of model control that is easier…

Machine Learning · Computer Science 2024-11-22 Sviatoslav Chalnev , Matthew Siu , Arthur Conmy

Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as…

Machine Learning · Computer Science 2026-03-04 Xuan Yang , Jiayu Liu , Yuhang Lai , Hao Xu , Zhenya Huang , Ning Miao

Sparse Autoencoders (SAEs) have recently gained attention as a means to improve the interpretability and steerability of Large Language Models (LLMs), both of which are essential for AI safety. In this work, we extend the application of…

Computer Vision and Pattern Recognition · Computer Science 2025-12-01 Mateusz Pach , Shyamgopal Karthik , Quentin Bouniot , Serge Belongie , Zeynep Akata

Sparse Autoencoders (SAEs) can extract interpretable features from large language models (LLMs) without supervision. However, their effectiveness in downstream steering tasks is limited by the requirement for contrastive datasets or large…

Computation and Language · Computer Science 2026-05-05 Seonglae Cho , Zekun Wu , Adriano Koshiyama

Sparse autoencoders (SAEs) have recently emerged as a powerful tool for language model steering. Prior work has explored top-k SAE latents for steering, but we observe that many dimensions among the top-k latents capture non-semantic…

Computation and Language · Computer Science 2025-10-03 Jiaqing Xie

Sparse autoencoders (SAEs) enable feature-level mechanistic interpretability and activation steering in large language models (LLMs), but SAE-based language control remains unreliable in multilingual settings: most SAEs are trained on…

Computation and Language · Computer Science 2026-05-25 Yusser Al Ghussin , Daniil Gurgurov , Tanja Baeumel , Josef van Genabith , Patrick Schramowski , Simon Ostermann

The ability of large language models (LLMs) to follow instructions is crucial for their practical applications, yet the underlying mechanisms remain poorly understood. This paper presents a novel framework that leverages sparse autoencoders…

Machine Learning · Computer Science 2025-02-18 Zirui He , Haiyan Zhao , Yiran Qiao , Fan Yang , Ali Payani , Jing Ma , Mengnan Du

Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses. Understanding their internal states is crucial for understanding their successes, diagnosing their failures,…

Computation and Language · Computer Science 2025-02-24 Xuansheng Wu , Jiayi Yuan , Wenlin Yao , Xiaoming Zhai , Ninghao Liu

Sparse Autoencoders (SAEs) are increasingly used to interpret foundation models, but their role as an actionable intervention space remains less understood, especially in vision. We study whether sparse visual features can be used not only…

Computer Vision and Pattern Recognition · Computer Science 2026-05-28 Gerasimos Chatzoudis , Zhuowei Li , Gemma E. Moran , Hao Wang , Dimitris N. Metaxas

Sparse autoencoders (SAEs) provide a powerful mechanism for decomposing the dense representations produced by Large Language Models (LLMs) into interpretable latent features. We posit that SAEs constitute a natural foundation for Learned…

Machine Learning · Computer Science 2026-03-17 Thibault Formal , Maxime Louis , Hervé Dejean , Stéphane Clinchant

Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, their generalization is inconsistent: while these models can perform impressively in some settings, fine-tuned…

Robotics · Computer Science 2026-03-20 Aiden Swann , Lachlain McGranahan , Hugo Buurmeijer , Monroe Kennedy , Mac Schwager

Large vision-language models (LVLMs) have achieved remarkable performance on multimodal tasks. However, they still suffer from hallucinations, generating text inconsistent with visual input, posing significant risks in real-world…

Computer Vision and Pattern Recognition · Computer Science 2025-09-16 Zhenglin Hua , Jinghan He , Zijun Yao , Tianxu Han , Haiyun Guo , Yuheng Jia , Junfeng Fang

Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations. However, there is a lack of evidence regarding the validity of their interpretations due to the lack of a…

Machine Learning · Computer Science 2025-02-25 Subhash Kantamneni , Joshua Engels , Senthooran Rajamanoharan , Max Tegmark , Neel Nanda

Recent LLMs like DeepSeek-R1 have demonstrated state-of-the-art performance by integrating deep thinking and complex reasoning during generation. However, the internal mechanisms behind these reasoning processes remain unexplored. We…

Computation and Language · Computer Science 2025-08-07 Andrey Galichin , Alexey Dontsov , Polina Druzhinina , Anton Razzhigaev , Oleg Y. Rogov , Elena Tutubalina , Ivan Oseledets

Sparse Autoencoders (SAEs) have been proposed as an unsupervised approach to learn a decomposition of a model's latent space. This enables useful applications such as steering - influencing the output of a model towards a desired concept -…

Machine Learning · Computer Science 2025-12-23 Dana Arad , Aaron Mueller , Yonatan Belinkov

Sparse Autoencoders (SAEs) are widely employed for mechanistic interpretability and model steering. Within this context, steering is by design performed by means of decoding altered SAE intermediate representations. This procedure…

Machine Learning · Computer Science 2025-12-08 Antonio Bărbălau , Cristian Daniel Păduraru , Teodor Poncu , Alexandru Tifrea , Elena Burceanu
‹ Prev 1 2 3 10 Next ›