English
Related papers

Related papers: Steered Generation via Gradient-Based Optimization…

200 papers

Large language models (LLMs) encode a diverse range of linguistic features within their latent representations, which can be harnessed to steer their output toward specific target characteristics. In this paper, we modify the internal…

Computation and Language · Computer Science 2025-02-27 Sumanta Bhattacharyya , Pedram Rooshenas

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but controlling their behavior reliably remains challenging, especially in open-ended generation settings. This paper…

Computation and Language · Computer Science 2025-12-08 Zirui He , Mingyu Jin , Bo Shen , Ali Payani , Yongfeng Zhang , Mengnan Du

Sparse Autoencoders (SAEs) have recently emerged as powerful tools for interpreting and steering the internal representations of large language models (LLMs). However, conventional approaches to analyzing SAEs typically rely solely on…

Machine Learning · Computer Science 2025-09-24 Dong Shu , Xuansheng Wu , Haiyan Zhao , Mengnan Du , Ninghao Liu

A key challenge in AI alignment is guiding large language models (LLMs) to follow desired behaviors at test time. Activation steering, which modifies internal model activations during inference, offers a potential solution. However, prior…

Machine Learning · Computer Science 2025-03-04 Reza Bayat , Ali Rahimi-Kalahroudi , Mohammad Pezeshki , Sarath Chandar , Pascal Vincent

Sparse autoencoders (SAEs) have recently emerged as a powerful tool for language model steering. Prior work has explored top-k SAE latents for steering, but we observe that many dimensions among the top-k latents capture non-semantic…

Computation and Language · Computer Science 2025-10-03 Jiaqing Xie

Large Reasoning Models (LRMs) exhibit human-like cognitive reasoning strategies (e.g. backtracking, cross-verification) during reasoning process, which improves their performance on complex tasks. Currently, reasoning strategies are…

Artificial Intelligence · Computer Science 2026-01-08 Yi Fang , Wenjie Wang , Mingfeng Xue , Boyi Deng , Fengli Xu , Dayiheng Liu , Fuli Feng

Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses. Understanding their internal states is crucial for understanding their successes, diagnosing their failures,…

Computation and Language · Computer Science 2025-02-24 Xuansheng Wu , Jiayi Yuan , Wenlin Yao , Xiaoming Zhai , Ninghao Liu

Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering…

Machine Learning · Computer Science 2025-04-02 Jeffrey Olmo , Jared Wilson , Max Forsey , Bryce Hepner , Thomas Vin Howe , David Wingate

Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as…

Machine Learning · Computer Science 2026-03-04 Xuan Yang , Jiayu Liu , Yuhang Lai , Hao Xu , Zhenya Huang , Ning Miao

Large language models (LLMs) exhibit impressive capabilities in generation tasks but are prone to producing harmful, misleading, or biased content, posing significant ethical and safety concerns. To mitigate such risks, representation…

Cryptography and Security · Computer Science 2025-11-17 Zeqing He , Zhibo Wang , Huiyu Xu , Hejun Lin , Wenhui Zhang , Zhixuan Chu

Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs. Recently, activation steering, a technique that modulates LLMs' behaviours by adjusting their latent…

Computation and Language · Computer Science 2025-01-23 Jingyuan Yang , Rongjun Li , Weixuan Wang , Ziyu Zhou , Zhiyong Feng , Wei Peng

The ability of large language models (LLMs) to follow instructions is crucial for their practical applications, yet the underlying mechanisms remain poorly understood. This paper presents a novel framework that leverages sparse autoencoders…

Machine Learning · Computer Science 2025-02-18 Zirui He , Haiyan Zhao , Yiran Qiao , Fan Yang , Ali Payani , Jing Ma , Mengnan Du

Deterministically controlling the target generation language of large multilingual language models (LLMs) remains a fundamental challenge, particularly in zero-shot settings where neither explicit language prompts nor fine-tuning are…

Computation and Language · Computer Science 2025-10-17 Cheng-Ting Chou , George Liu , Jessica Sun , Cole Blondin , Kevin Zhu , Vasu Sharma , Sean O'Brien

Sparse autoencoders (SAEs) enable feature-level mechanistic interpretability and activation steering in large language models (LLMs), but SAE-based language control remains unreliable in multilingual settings: most SAEs are trained on…

Computation and Language · Computer Science 2026-05-25 Yusser Al Ghussin , Daniil Gurgurov , Tanja Baeumel , Josef van Genabith , Patrick Schramowski , Simon Ostermann

Steering, or direct manipulation of internal activations to guide LLM responses toward specific semantic concepts, is emerging as a promising avenue for both understanding how semantic concepts are stored within LLMs and advancing LLM…

Machine Learning · Computer Science 2026-02-03 Parmida Davarmanesh , Ashia Wilson , Adityanarayanan Radhakrishnan

Unsupervised approaches to large language model (LLM) interpretability, such as sparse autoencoders (SAEs), offer a way to decode LLM activations into interpretable and, ideally, controllable concepts. On the one hand, these approaches…

Machine Learning · Computer Science 2026-03-03 Shruti Joshi , Andrea Dittadi , Sébastien Lachapelle , Dhanya Sridhar

Recently, sparse autoencoders (SAEs) have emerged as a promising technique for interpreting activations in foundation models by disentangling features into a sparse set of concepts. However, identifying the optimal level of sparsity for…

Machine Learning · Computer Science 2026-04-17 Dongsheng Wang , Jinsen Zhang , Dawei Su , Hui Huang

Sparse Autoencoders (SAEs) can extract interpretable features from large language models (LLMs) without supervision. However, their effectiveness in downstream steering tasks is limited by the requirement for contrastive datasets or large…

Computation and Language · Computer Science 2026-05-05 Seonglae Cho , Zekun Wu , Adriano Koshiyama

Precise control over language model generation is vital for ensuring both safety and reliability. Although prompt engineering and steering are commonly used to intervene in model behaviors, the vast number of parameters in models often…

Computation and Language · Computer Science 2025-06-04 Mengru Wang , Ziwen Xu , Shengyu Mao , Shumin Deng , Zhaopeng Tu , Huajun Chen , Ningyu Zhang

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We…

Machine Learning · Computer Science 2024-05-01 Senthooran Rajamanoharan , Arthur Conmy , Lewis Smith , Tom Lieberum , Vikrant Varma , János Kramár , Rohin Shah , Neel Nanda
‹ Prev 1 2 3 10 Next ›