English
Related papers

Related papers: Weight-sparse transformers have interpretable circ…

200 papers

Mechanistic interpretability aims to explain neural model behaviour by reverse-engineering learned computational structure into human-understandable components. Without a formal framework, however, mechanistic explanations cannot be…

Machine Learning · Computer Science 2026-05-12 Ward Gauderis , Thomas Dooms , Steven T. Holmer , Kola Ayonrinde , Geraint A. Wiggins

A growing intuition in machine learning suggests a link between sparsity and interpretability. We introduce a novel self-ablation mechanism to investigate this connection ante-hoc in the context of language transformers. Our approach…

Machine Learning · Computer Science 2025-05-02 Jeremias Ferrao , Luhan Mikaelson , Keenan Pepper , Natalia Perez-Campanero Antolin

To tackle interpretability in deep learning, we present a novel framework to jointly learn a predictive model and its associated interpretation model. The interpreter provides both local and global interpretability about the predictive…

Machine Learning · Computer Science 2022-02-24 Jayneel Parekh , Pavlo Mozharovskyi , Florence d'Alché-Buc

Mechanistic interpretability aims to understand how neural networks generalize beyond their training data by reverse-engineering their internal structures. We introduce patterning as the dual problem: given a desired form of generalization,…

Machine Learning · Computer Science 2026-01-21 George Wang , Daniel Murfet

We propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance. We prototype this approach by…

Machine Learning · Computer Science 2024-12-25 Jason Gross , Rajashree Agrawal , Thomas Kwa , Euan Ong , Chun Hei Yip , Alex Gibson , Soufiane Noubir , Lawrence Chan

A fundamental question in interpretability research is to what extent neural networks, particularly language models, implement reusable functions through subnetworks that can be composed to perform more complex tasks. Recent advances in…

Machine Learning · Computer Science 2025-06-24 Philipp Mondorf , Sondre Wold , Barbara Plank

Transformer-based models have become state-of-the-art tools in various machine learning tasks, including time series classification, yet their complexity makes understanding their internal decision-making challenging. Existing…

Machine Learning · Computer Science 2025-11-27 Matīss Kalnāre , Sofoklis Kitharidis , Thomas Bäck , Niki van Stein

The usage of transformers has grown from learning about language semantics to forming meaningful visiolinguistic representations. These architectures are often over-parametrized, requiring large amounts of computation. In this work, we…

Computation and Language · Computer Science 2020-07-09 Prajjwal Bhargava

Mechanistic interpretability work attempts to reverse engineer the learned algorithms present inside neural networks. One focus of this work has been to discover 'circuits' -- subgraphs of the full model that explain behaviour on specific…

Machine Learning · Computer Science 2024-07-12 Joseph Miller , Bilal Chughtai , William Saunders

Large language models (LLMs) have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque. Mechanistic interpretability (i.e., the systematic study of how neural networks…

Computation and Language · Computer Science 2026-02-13 Usman Naseem

While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret. Recent work has aimed to reverse engineer transformer models into human-readable representations called…

Computation and Language · Computer Science 2024-10-08 Michael Lan , Philip Torr , Fazl Barez

Models often need to be constrained to a certain size for them to be considered interpretable. For example, a decision tree of depth 5 is much easier to understand than one of depth 50. Limiting model size, however, often reduces accuracy.…

Machine Learning · Computer Science 2020-07-02 Abhishek Ghose , Balaraman Ravindran

Mechanistic interpretability aims to reverse engineer neural networks by uncovering which high-level algorithms they implement. Causal abstraction provides a precise notion of when a network implements an algorithm, i.e., a causal model of…

Machine Learning · Computer Science 2025-03-17 Theodora-Mara Pîslar , Sara Magliacane , Atticus Geiger

Sparse dictionary learning has been a rapidly growing technique in mechanistic interpretability to attack superposition and extract more human-understandable features from model activations. We ask a further question based on the extracted…

Machine Learning · Computer Science 2024-02-20 Zhengfu He , Xuyang Ge , Qiong Tang , Tianxiang Sun , Qinyuan Cheng , Xipeng Qiu

The circuits framework in mechanistic interpretability aims to identify causally important sparse subgraphs of model components, typically evaluated by measuring necessity and sufficiency. We measure circuit reuse, the proportion of…

Computation and Language · Computer Science 2026-05-12 Michael Li , Nishant Subramani

Several recent works have empirically observed that Convolutional Neural Nets (CNNs) are (approximately) invertible. To understand this approximate invertibility phenomenon and how to leverage it more effectively, we focus on a theoretical…

Machine Learning · Statistics 2017-05-25 Anna C. Gilbert , Yi Zhang , Kibok Lee , Yuting Zhang , Honglak Lee

Word embeddings have found their way into a wide range of natural language processing tasks including those in the biomedical domain. While these vector representations successfully capture semantic and syntactic word relations, hidden…

Mechanistic interpretability aims to break models into meaningful parts; verifying that two such parts implement the same computation is a prerequisite. Existing similarity measures evaluate either empirical behaviour, leaving them blind to…

Machine Learning · Computer Science 2026-05-15 ML Nissen Gonzalez , Melwina Albuquerque , Laurence Wroe , Jacob Meyer Cohen , Logan Riggs Smith , Thomas Dooms

Understanding the decision-making processes of neural networks is a central goal of mechanistic interpretability. In the context of Large Language Models (LLMs), this involves uncovering the underlying mechanisms and identifying the roles…

Computation and Language · Computer Science 2026-04-21 Nils Feldhus , Laura Kopf

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a…

Machine Learning · Computer Science 2023-10-31 Arthur Conmy , Augustine N. Mavor-Parker , Aengus Lynch , Stefan Heimersheim , Adrià Garriga-Alonso