Related papers: Weight-sparse transformers have interpretable circ…

From Mechanistic to Compositional Interpretability

Mechanistic interpretability aims to explain neural model behaviour by reverse-engineering learned computational structure into human-understandable components. Without a formal framework, however, mechanistic explanations cannot be…

Machine Learning · Computer Science 2026-05-12 Ward Gauderis , Thomas Dooms , Steven T. Holmer , Kola Ayonrinde , Geraint A. Wiggins

Self-Ablating Transformers: More Interpretability, Less Sparsity

A growing intuition in machine learning suggests a link between sparsity and interpretability. We introduce a novel self-ablation mechanism to investigate this connection ante-hoc in the context of language transformers. Our approach…

Machine Learning · Computer Science 2025-05-02 Jeremias Ferrao , Luhan Mikaelson , Keenan Pepper , Natalia Perez-Campanero Antolin

A Framework to Learn with Interpretation

To tackle interpretability in deep learning, we present a novel framework to jointly learn a predictive model and its associated interpretation model. The interpreter provides both local and global interpretability about the predictive…

Machine Learning · Computer Science 2022-02-24 Jayneel Parekh , Pavlo Mozharovskyi , Florence d'Alché-Buc

Patterning: The Dual of Interpretability

Mechanistic interpretability aims to understand how neural networks generalize beyond their training data by reverse-engineering their internal structures. We introduce patterning as the dual problem: given a desired form of generalization,…

Machine Learning · Computer Science 2026-01-21 George Wang , Daniel Murfet

Compact Proofs of Model Performance via Mechanistic Interpretability

We propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance. We prototype this approach by…

Machine Learning · Computer Science 2024-12-25 Jason Gross , Rajashree Agrawal , Thomas Kwa , Euan Ong , Chun Hei Yip , Alex Gibson , Soufiane Noubir , Lawrence Chan

Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models

A fundamental question in interpretability research is to what extent neural networks, particularly language models, implement reusable functions through subnetworks that can be composed to perform more complex tasks. Recent advances in…

Machine Learning · Computer Science 2025-06-24 Philipp Mondorf , Sondre Wold , Barbara Plank

Mechanistic Interpretability for Transformer-based Time Series Classification

Transformer-based models have become state-of-the-art tools in various machine learning tasks, including time series classification, yet their complexity makes understanding their internal decision-making challenging. Existing…

Machine Learning · Computer Science 2025-11-27 Matīss Kalnāre , Sofoklis Kitharidis , Thomas Bäck , Niki van Stein

Adaptive Transformers for Learning Multimodal Representations

The usage of transformers has grown from learning about language semantics to forming meaningful visiolinguistic representations. These architectures are often over-parametrized, requiring large amounts of computation. In this work, we…

Computation and Language · Computer Science 2020-07-09 Prajjwal Bhargava

Transformer Circuit Faithfulness Metrics are not Robust

Mechanistic interpretability work attempts to reverse engineer the learned algorithms present inside neural networks. One focus of this work has been to discover 'circuits' -- subgraphs of the full model that explain behaviour on specific…

Machine Learning · Computer Science 2024-07-12 Joseph Miller , Bilal Chughtai , William Saunders

Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions

Large language models (LLMs) have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque. Mechanistic interpretability (i.e., the systematic study of how neural networks…

Computation and Language · Computer Science 2026-02-13 Usman Naseem

Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models

While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret. Recent work has aimed to reverse engineer transformer models into human-readable representations called…

Computation and Language · Computer Science 2024-10-08 Michael Lan , Philip Torr , Fazl Barez

Interpretability with Accurate Small Models

Models often need to be constrained to a certain size for them to be considered interpretable. For example, a decision tree of depth 5 is much easier to understand than one of depth 50. Limiting model size, however, often reduces accuracy.…

Machine Learning · Computer Science 2020-07-02 Abhishek Ghose , Balaraman Ravindran

Combining Causal Models for More Accurate Abstractions of Neural Networks

Mechanistic interpretability aims to reverse engineer neural networks by uncovering which high-level algorithms they implement. Causal abstraction provides a precise notion of when a network implements an algorithm, i.e., a causal model of…

Machine Learning · Computer Science 2025-03-17 Theodora-Mara Pîslar , Sara Magliacane , Atticus Geiger

Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT

Sparse dictionary learning has been a rapidly growing technique in mechanistic interpretability to attack superposition and extract more human-understandable features from model activations. We ask a further question based on the extracted…

Machine Learning · Computer Science 2024-02-20 Zhengfu He , Xuyang Ge , Qiong Tang , Tianxiang Sun , Qinyuan Cheng , Xipeng Qiu

How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits

The circuits framework in mechanistic interpretability aims to identify causally important sparse subgraphs of model components, typically evaluated by measuring necessity and sufficiency. We measure circuit reuse, the proportion of…

Computation and Language · Computer Science 2026-05-12 Michael Li , Nishant Subramani

Towards Understanding the Invertibility of Convolutional Neural Networks

Several recent works have empirically observed that Convolutional Neural Nets (CNNs) are (approximately) invertible. To understand this approximate invertibility phenomenon and how to leverage it more effectively, we focus on a theoretical…

Machine Learning · Statistics 2017-05-25 Anna C. Gilbert , Yi Zhang , Kibok Lee , Yuting Zhang , Honglak Lee

Evaluating Sparse Interpretable Word Embeddings for Biomedical Domain

Word embeddings have found their way into a wide range of natural language processing tasks including those in the biomedical domain. While these vector representations successfully capture semantic and syntactic word relations, hidden…

Computation and Language · Computer Science 2020-05-12 Mohammad Amin Samadi , Mohammad Sadegh Akhondzadeh , Sayed Jalal Zahabi , Mohammad Hossein Manshaei , Zeinab Maleki , Payman Adibi

When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

Mechanistic interpretability aims to break models into meaningful parts; verifying that two such parts implement the same computation is a prerequisite. Existing similarity measures evaluate either empirical behaviour, leaving them blind to…

Machine Learning · Computer Science 2026-05-15 ML Nissen Gonzalez , Melwina Albuquerque , Laurence Wroe , Jacob Meyer Cohen , Logan Riggs Smith , Thomas Dooms

Interpreting Language Models Through Concept Descriptions: A Survey

Understanding the decision-making processes of neural networks is a central goal of mechanistic interpretability. In the context of Large Language Models (LLMs), this involves uncovering the underlying mechanisms and identifying the roles…

Computation and Language · Computer Science 2026-04-21 Nils Feldhus , Laura Kopf

Towards Automated Circuit Discovery for Mechanistic Interpretability

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a…

Machine Learning · Computer Science 2023-10-31 Arthur Conmy , Augustine N. Mavor-Parker , Aengus Lynch , Stefan Heimersheim , Adrià Garriga-Alonso