English
Related papers

Related papers: Representation Tuning

200 papers

The ability to follow instructions is crucial for numerous real-world applications of language models. In pursuit of deeper insights and more powerful capabilities, we derive instruction-specific vector representations from language models…

Computation and Language · Computer Science 2025-04-15 Alessandro Stolfo , Vidhisha Balachandran , Safoora Yousefi , Eric Horvitz , Besmira Nushi

Recent advancements in large language models (LLMs) have resulted in increasingly anthropomorphic language concerning the ability of LLMs to reason. Whether reasoning in LLMs should be understood to be inherently different is, however,…

Machine Learning · Computer Science 2025-07-28 Bertram Højer , Oliver Jarvis , Stefan Heinrich

Changing the behavior of large language models (LLMs) can be as straightforward as editing the Transformer's residual streams using appropriately constructed "steering vectors." These modifications to internal neural activations, a form of…

Computation and Language · Computer Science 2025-05-20 Jian-Qiao Zhu , Haijiang Yan , Thomas L. Griffiths

The honesty of large language models (LLMs) is a critical alignment challenge, especially as advanced systems with chain-of-thought (CoT) reasoning may strategically deceive humans. Unlike traditional honesty issues on LLMs, which could be…

Artificial Intelligence · Computer Science 2025-06-06 Kai Wang , Yihao Zhang , Meng Sun

Inference-time LLM alignment methods, particularly activation steering, offer an alternative to fine-tuning by directly modifying activations during generation. Existing methods, however, often rely on non-anticipative interventions that…

Machine Learning · Computer Science 2026-04-22 Julian Skifstad , Xinyue Annie Yang , Glen Chou

The field of large language models (LLMs) has grown rapidly in recent years, driven by the desire for better efficiency, interpretability, and safe use. Building on the novel approach of "activation engineering," this study explores…

Computation and Language · Computer Science 2025-08-26 Rumi Allbert , James K. Wiles , Vlad Grankovsky

This research explores strategies for steering the output of large language models (LLMs) towards specific styles, such as sentiment, emotion, or writing style, by adding style vectors to the activations of hidden layers during text…

Computation and Language · Computer Science 2024-02-05 Kai Konen , Sophie Jentzsch , Diaoulé Diallo , Peer Schütt , Oliver Bensch , Roxanne El Baff , Dominik Opitz , Tobias Hecking

Large language models have transformed AI, yet reliably controlling their outputs remains a challenge. This paper explores activation engineering, where outputs of pre-trained LLMs are controlled by manipulating their activations at…

Neural and Evolutionary Computing · Computer Science 2025-05-13 Joris Postmus , Steven Abreu

Representation engineering methods have recently shown promise for enabling efficient steering of model behavior. However, evaluation pipelines for these methods have primarily relied on subjective demonstrations, instead of quantitative,…

Artificial Intelligence · Computer Science 2024-10-23 Itamar Pres , Laura Ruis , Ekdeep Singh Lubana , David Krueger

As large language models (LLMs) become more integrated into societal systems, the risk of them perpetuating and amplifying harmful biases becomes a critical safety concern. Traditional methods for mitigating bias often rely on data…

Artificial Intelligence · Computer Science 2025-08-13 Shivam Dubey

Large Language Models (LLMs) demonstrate increasing conversational fluency, yet instilling them with nuanced, human-like emotional expression remains a significant challenge. Current alignment techniques often address surface-level output…

Computation and Language · Computer Science 2025-11-25 Niranjan Chebrolu , Gerard Christopher Yeo , Kokil Jaidka

Recent advances in automated theorem proving use Large Language Models (LLMs) to translate informal mathematical statements into formal proofs. However, informal cues are often ambiguous or lack strict logical structure, making it hard for…

Machine Learning · Computer Science 2025-10-14 Shashank Kirtania , Arun Iyer

This paper investigates how Large Language Models (LLMs) represent non-English tokens -- a question that remains underexplored despite recent progress. We propose a lightweight intervention method using representation steering, where a…

Computation and Language · Computer Science 2025-08-27 Omar Mahmoud , Buddhika Laknath Semage , Thommen George Karimpanal , Santu Rana

The ability to steer AI behavior is crucial to preventing its long term dangerous and catastrophic potential. Representation Engineering (RepE) has emerged as a novel, powerful method to steer internal model behaviors, such as "honesty", at…

Machine Learning · Computer Science 2024-10-10 Akshat Kannan

Applying steering vectors to large language models (LLMs) is an efficient and effective model alignment technique, but we lack an interpretable explanation for how it works-- specifically, what internal mechanisms steering vectors affect…

Machine Learning · Computer Science 2026-04-10 Stephen Cheng , Sarah Wiegreffe , Dinesh Manocha

Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs. Recently, activation steering, a technique that modulates LLMs' behaviours by adjusting their latent…

Computation and Language · Computer Science 2025-01-23 Jingyuan Yang , Rongjun Li , Weixuan Wang , Ziyu Zhou , Zhiyong Feng , Wei Peng

Large Language Models (LLMs) have emerged as powerful tools, but their inherent safety risks - ranging from harmful content generation to broader societal harms - pose significant challenges. These risks can be amplified by the recent…

In autoregressive large language models (LLMs), temporal straightening offers an account of how the next-token prediction objective shapes representations. Models learn to progressively straighten the representational trajectory of input…

Artificial Intelligence · Computer Science 2026-04-28 Jack King , Evelina Fedorenko , Eghbal A. Hosseini

Controlling the behavior of Large Language Models (LLMs) remains a significant challenge due to their inherent complexity and opacity. While techniques like fine-tuning can modify model behavior, they typically require extensive…

Artificial Intelligence · Computer Science 2025-05-07 Yixiong Hao , Ayush Panda , Stepan Shabalin , Sheikh Abdur Raheem Ali

Controlling the behavior of large language models (LLMs) at inference time is essential for aligning outputs with human abilities and safety requirements. \emph{Activation steering} provides a lightweight alternative to prompt engineering…

Artificial Intelligence · Computer Science 2026-01-30 Diaoulé Diallo , Katharina Dworatzyk , Sophie Jentzsch , Peer Schütt , Sabine Theis , Tobias Hecking
‹ Prev 1 2 3 10 Next ›