English

Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement

Machine Learning 2025-10-06 v2 Artificial Intelligence

Abstract

Steering has emerged as a promising approach in controlling large language models (LLMs) without modifying model parameters. However, most existing steering methods rely on large-scale datasets to learn clear behavioral information, which limits their applicability in many real-world scenarios. The steering vectors extracted from small dataset often contain task-irrelevant noising features, which degrades their effectiveness. To refine the steering vectors learned from limited data, we introduce Refinement of Steering Vector via Sparse Autoencoder (SAE-RSV) that leverages SAEs to semantically denoise and augment the steering vectors. In our framework, we first remove task-irrelevant features according to their semantics provided by SAEs, and then enrich task-relevant features missing from the small dataset through their semantic similarity to the identified relevant features. Extensive experiments demonstrate that the proposed SAE-RSV substantially outperforms all the baseline methods including supervised fine-tuning. Our findings show that effective steering vector can be constructed from limited training data by refining the original steering vector through SAEs.

Keywords

Cite

@article{arxiv.2509.23799,
  title  = {Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement},
  author = {Anyi Wang and Xuansheng Wu and Dong Shu and Yunpu Ma and Ninghao Liu},
  journal= {arXiv preprint arXiv:2509.23799},
  year   = {2025}
}

Comments

19 pages, 11 figures, 7 tables

R2 v1 2026-07-01T06:02:24.142Z