Related papers: Learning Transformer Programs

Discovering Interpretable Algorithms by Decompiling Transformers to RASP

Recent work has shown that the computations of Transformers can be simulated in the RASP family of programming languages. These findings have enabled improved understanding of the expressive capacity and generalization abilities of…

Machine Learning · Computer Science 2026-02-10 Xinting Huang , Aleksandra Bakalova , Satwik Bhattamishra , William Merrill , Michael Hahn

Weight-sparse transformers have interpretable circuits

Finding human-understandable circuits in language models is a central goal of the field of mechanistic interpretability. We train models to have more understandable circuits by constraining most of their weights to be zeros, so that each…

Machine Learning · Computer Science 2025-11-18 Leo Gao , Achyuta Rajaram , Jacob Coxon , Soham V. Govande , Bowen Baker , Dan Mossing

Thinking Like Transformers

What is the computational model behind a Transformer? Where recurrent neural networks have direct parallels in finite state machines, allowing clear discussion and thought around architecture variants or trained models, Transformers have no…

Machine Learning · Computer Science 2021-07-20 Gail Weiss , Yoav Goldberg , Eran Yahav

Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars

Interpretability methods aim to understand the algorithm implemented by a trained model (e.g., a Transofmer) by examining various aspects of the model, such as the weight matrices or the attention patterns. In this work, through a…

Machine Learning · Computer Science 2023-12-05 Kaiyue Wen , Yuchen Li , Bingbin Liu , Andrej Risteski

Tracr: Compiled Transformers as a Laboratory for Interpretability

We show how to "compile" human-readable programs into standard decoder-only transformer models. Our compiler, Tracr, generates models with known structure. This structure can be used to design experiments. For example, we use it to study…

Machine Learning · Computer Science 2023-11-06 David Lindner , János Kramár , Sebastian Farquhar , Matthew Rahtz , Thomas McGrath , Vladimir Mikulik

Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer

Algorithm extraction aims to synthesize executable programs directly from models trained on algorithmic tasks, enabling de novo algorithm discovery without relying on human-written code. However, applying this paradigm to Transformer is…

Machine Learning · Computer Science 2026-03-20 Yifan Zhang , Wei Bi , Kechi Zhang , Dongming Jin , Jie Fu , Zhi Jin

Manipulating Transformer-Based Models: Controllability, Steerability, and Robust Interventions

Transformer-based language models excel in NLP tasks, but fine-grained control remains challenging. This paper explores methods for manipulating transformer models through principled interventions at three levels: prompts, activations, and…

Computation and Language · Computer Science 2025-09-08 Faruk Alpay , Taylan Alpay

A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models

Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations. Recently, MI has garnered significant attention for…

Artificial Intelligence · Computer Science 2025-10-14 Daking Rai , Yilun Zhou , Shi Feng , Abulhair Saparov , Ziyu Yao

Generating Interpretable Networks using Hypernetworks

An essential goal in mechanistic interpretability to decode a network, i.e., to convert a neural network's raw weights to an interpretable algorithm. Given the difficulty of the decoding problem, progress has been made to understand the…

Machine Learning · Computer Science 2023-12-07 Isaac Liao , Ziming Liu , Max Tegmark

Transcoders Find Interpretable LLM Feature Circuits

A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based language…

Machine Learning · Computer Science 2024-11-08 Jacob Dunefsky , Philippe Chlenski , Neel Nanda

Neural Decompiling of Tracr Transformers

Recently, the transformer architecture has enabled substantial progress in many areas of pattern recognition and machine learning. However, as with other neural network models, there is currently no general method available to explain their…

Machine Learning · Computer Science 2024-12-02 Hannes Thurnherr , Kaspar Riesen

Linguistic Interpretability of Transformer-based Language Models: a systematic review

Language models based on the Transformer architecture achieve excellent results in many language-related tasks, such as text classification or sentiment analysis. However, despite the architecture of these models being well-defined, little…

Computation and Language · Computer Science 2025-04-14 Miguel López-Otal , Jorge Gracia , Jordi Bernad , Carlos Bobed , Lucía Pitarch-Ballesteros , Emma Anglés-Herrero

Interpreting Transformers Through Attention Head Intervention

Neural networks are growing more capable on their own, but we do not understand their neural mechanisms. Understanding these mechanisms' decision-making processes, or mechanistic interpretability, enables (1) accountability and control in…

Computation and Language · Computer Science 2026-03-02 Mason Kadem , Rong Zheng

Large Language Models are Interpretable Learners

The trade-off between expressiveness and interpretability remains a core challenge when building human-centric predictive models for classification and decision-making. While symbolic rules offer interpretability, they often lack…

Artificial Intelligence · Computer Science 2024-06-26 Ruochen Wang , Si Si , Felix Yu , Dorothea Wiesmann , Cho-Jui Hsieh , Inderjit Dhillon

Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits

Transformer-based language models exhibit complex and distributed behavior, yet their internal computations remain poorly understood. Existing mechanistic interpretability methods typically treat attention heads and multilayer perceptron…

Machine Learning · Computer Science 2025-11-26 Areeb Ahmad , Abhinav Joshi , Ashutosh Modi

Techniques for Interpretable Machine Learning

Interpretable machine learning tackles the important problem that humans cannot understand the behaviors of complex machine learning models and how these models arrive at a particular decision. Although many approaches have been proposed, a…

Machine Learning · Computer Science 2019-05-21 Mengnan Du , Ninghao Liu , Xia Hu

Can Transformers Learn to Solve Problems Recursively?

Neural networks have in recent years shown promise for helping software engineers write programs and even formally verify them. While semantic information plays a crucial part in these processes, it remains unclear to what degree popular…

Machine Learning · Computer Science 2023-06-27 Shizhuo Dylan Zhang , Curt Tigges , Stella Biderman , Maxim Raginsky , Talia Ringer

Validating Mechanistic Interpretations: An Axiomatic Approach

Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the…

Machine Learning · Computer Science 2025-06-24 Nils Palumbo , Ravi Mangal , Zifan Wang , Saranya Vijayakumar , Corina S. Pasareanu , Somesh Jha

What Algorithms can Transformers Learn? A Study in Length Generalization

Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity. This raises the question of if and when Transformer models can learn the true…

Machine Learning · Computer Science 2023-10-25 Hattie Zhou , Arwen Bradley , Etai Littwin , Noam Razin , Omid Saremi , Josh Susskind , Samy Bengio , Preetum Nakkiran

Mechanistic interpretability of large language models with applications to the financial services industry

Large Language Models such as GPTs (Generative Pre-trained Transformers) exhibit remarkable capabilities across a broad spectrum of applications. Nevertheless, due to their intrinsic complexity, these models present substantial challenges…

Machine Learning · Computer Science 2024-10-17 Ashkan Golgoon , Khashayar Filom , Arjun Ravi Kannan