Related papers: Patch-Effect Graph Kernels for LLM Interpretabilit…

Automated Circuit Interpretation via Probe Prompting

Mechanistic interpretability aims to understand neural networks by identifying which learned features mediate specific behaviors. Attribution graphs reveal these feature pathways, but interpreting them requires extensive manual analysis --…

Computation and Language · Computer Science 2025-11-11 Giuseppe Birardi

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

Mechanistic interpretability aims to understand model behaviors in terms of specific, interpretable features, often hypothesized to manifest as low-dimensional subspaces of activations. Specifically, recent studies have explored subspace…

Machine Learning · Computer Science 2023-12-07 Aleksandar Makelov , Georg Lange , Neel Nanda

Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations

Mechanistic interpretability identifies internal circuits responsible for model behaviors, yet translating these findings into human-understandable explanations remains an open problem. We present a pipeline that bridges circuit-level…

Computation and Language · Computer Science 2026-03-12 Ajay Pravin Mahale

Concept-Based Mechanistic Interpretability Using Structured Knowledge Graphs

While concept-based interpretability methods have traditionally focused on local explanations of neural network predictions, we propose a novel framework and interactive tool that extends these methods into the domain of mechanistic…

Machine Learning · Computer Science 2025-07-09 Sofiia Chorna , Kateryna Tarelkina , Eloïse Berthier , Gianni Franchi

Continuous-Depth Field Theory for Transformer Patching and Mechanistic Interpretability

Mechanistic interpretability often uses activation patching, causal tracing, path patching, and steering directions to reveal behaviorally meaningful directions in Transformer activation space. This paper develops a field-theoretic…

Machine Learning · Computer Science 2026-05-26 David N. Olivieri , Antonio F. Pérez Rodríguez

Mechanistic Interpretability for Transformer-based Time Series Classification

Transformer-based models have become state-of-the-art tools in various machine learning tasks, including time series classification, yet their complexity makes understanding their internal decision-making challenging. Existing…

Machine Learning · Computer Science 2025-11-27 Matīss Kalnāre , Sofoklis Kitharidis , Thomas Bäck , Niki van Stein

Dissecting Jet-Tagger Through Mechanistic Interpretability

Mechanistic interpretability seeks to reverse engineer a trained neural network by identifying the minimal subset of internal components. We perform a mechanistic interpretability analysis of the Particle Transformer architecture, trained…

High Energy Physics - Phenomenology · Physics 2026-05-12 Saurabh Rai , Sanmay Ganguly

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step. Activation patching, also known as causal tracing or…

Machine Learning · Computer Science 2024-01-18 Fred Zhang , Neel Nanda

Weight Patching: Toward Source-Level Mechanistic Localization in LLMs

Mechanistic interpretability seeks to localize model behavior to the internal components that causally realize it. Prior work has advanced activation-space localization and causal tracing, but modules that appear important in activation…

Artificial Intelligence · Computer Science 2026-04-16 Chenghao Sun , Chengsheng Zhang , Guanzheng Qin , Rui Dai , Xinmei Tian

Mechanistic Interpretability in the Presence of Architectural Obfuscation

Architectural obfuscation - e.g., permuting hidden-state tensors, linearly transforming embedding tables, or remapping tokens - has recently gained traction as a lightweight substitute for heavyweight cryptography in privacy-preserving…

Cryptography and Security · Computer Science 2025-06-24 Marcos Florencio , Thomas Barton

PatchGT: Transformer over Non-trainable Clusters for Learning Graph Representations

Recently the Transformer structure has shown good performances in graph learning tasks. However, these Transformer models directly work on graph nodes and may have difficulties learning high-level information. Inspired by the vision…

Machine Learning · Computer Science 2023-04-11 Han Gao , Xu Han , Jiaoyang Huang , Jian-Xun Wang , Li-Ping Liu

Causal Analysis for Robust Interpretability of Neural Networks

Interpreting the inner function of neural networks is crucial for the trustworthy development and deployment of these black-box models. Prior interpretability methods focus on correlation-based measures to attribute model decisions to…

Machine Learning · Computer Science 2023-06-21 Ola Ahmad , Nicolas Bereux , Loïc Baret , Vahid Hashemi , Freddy Lecue

Mechanistic interpretability of large language models with applications to the financial services industry

Large Language Models such as GPTs (Generative Pre-trained Transformers) exhibit remarkable capabilities across a broad spectrum of applications. Nevertheless, due to their intrinsic complexity, these models present substantial challenges…

Machine Learning · Computer Science 2024-10-17 Ashkan Golgoon , Khashayar Filom , Arjun Ravi Kannan

Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT

Sparse dictionary learning has been a rapidly growing technique in mechanistic interpretability to attack superposition and extract more human-understandable features from model activations. We ask a further question based on the extracted…

Machine Learning · Computer Science 2024-02-20 Zhengfu He , Xuyang Ge , Qiong Tang , Tianxiang Sun , Qinyuan Cheng , Xipeng Qiu

Towards Automated Circuit Discovery for Mechanistic Interpretability

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a…

Machine Learning · Computer Science 2023-10-31 Arthur Conmy , Augustine N. Mavor-Parker , Aengus Lynch , Stefan Heimersheim , Adrià Garriga-Alonso

GIN-Graph: A Generative Interpretation Network for Model-Level Explanation of Graph Neural Networks

One significant challenge of exploiting Graph neural networks (GNNs) in real-life scenarios is that they are always treated as black boxes, therefore leading to the requirement of interpretability. To address this, model-level…

Machine Learning · Computer Science 2025-09-22 Xiao Yue , Guangzhi Qu , Lige Gan

CLT-Forge: A Scalable Library for Cross-Layer Transcoders and Attribution Graphs

Mechanistic interpretability seeks to understand how Large Language Models (LLMs) represent and process information. Recent approaches based on dictionary learning and transcoders enable representing model computation in terms of sparse,…

Machine Learning · Computer Science 2026-03-24 Florent Draye , Abir Harrasse , Vedant Palit , Tung-Yu Wu , Jiarui Liu , Punya Syon Pandey , Roderick Wu , Terry Jingchen Zhang , Zhijing Jin , Bernhard Schölkopf

Seeing Through Circuits: Faithful Mechanistic Interpretability for Vision Transformers

Transparency of neural networks' internal reasoning is at the heart of interpretability research, adding to trust, safety, and understanding of these models. The field of mechanistic interpretability has recently focused on studying…

Artificial Intelligence · Computer Science 2026-04-17 Nina Żukowska , Wolfgang Stammer , Bernt Schiele , Jonas Fischer

Attribution Patching Outperforms Automated Circuit Discovery

Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation…

Machine Learning · Computer Science 2023-11-21 Aaquib Syed , Can Rager , Arthur Conmy

Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability

Mechanistic interpretability aims to understand the inner workings of large neural networks by identifying circuits, or minimal subgraphs within the model that implement algorithms responsible for performing specific tasks. These circuits…

Machine Learning · Computer Science 2024-12-06 Jatin Nainani , Sankaran Vaidyanathan , AJ Yeung , Kartik Gupta , David Jensen