English
Related papers

Related papers: Patch-Effect Graph Kernels for LLM Interpretabilit…

200 papers

Mechanistic interpretability aims to understand neural networks by identifying which learned features mediate specific behaviors. Attribution graphs reveal these feature pathways, but interpreting them requires extensive manual analysis --…

Computation and Language · Computer Science 2025-11-11 Giuseppe Birardi

Mechanistic interpretability aims to understand model behaviors in terms of specific, interpretable features, often hypothesized to manifest as low-dimensional subspaces of activations. Specifically, recent studies have explored subspace…

Machine Learning · Computer Science 2023-12-07 Aleksandar Makelov , Georg Lange , Neel Nanda

Mechanistic interpretability identifies internal circuits responsible for model behaviors, yet translating these findings into human-understandable explanations remains an open problem. We present a pipeline that bridges circuit-level…

Computation and Language · Computer Science 2026-03-12 Ajay Pravin Mahale

While concept-based interpretability methods have traditionally focused on local explanations of neural network predictions, we propose a novel framework and interactive tool that extends these methods into the domain of mechanistic…

Machine Learning · Computer Science 2025-07-09 Sofiia Chorna , Kateryna Tarelkina , Eloïse Berthier , Gianni Franchi

Mechanistic interpretability often uses activation patching, causal tracing, path patching, and steering directions to reveal behaviorally meaningful directions in Transformer activation space. This paper develops a field-theoretic…

Machine Learning · Computer Science 2026-05-26 David N. Olivieri , Antonio F. Pérez Rodríguez

Transformer-based models have become state-of-the-art tools in various machine learning tasks, including time series classification, yet their complexity makes understanding their internal decision-making challenging. Existing…

Machine Learning · Computer Science 2025-11-27 Matīss Kalnāre , Sofoklis Kitharidis , Thomas Bäck , Niki van Stein

Mechanistic interpretability seeks to reverse engineer a trained neural network by identifying the minimal subset of internal components. We perform a mechanistic interpretability analysis of the Particle Transformer architecture, trained…

High Energy Physics - Phenomenology · Physics 2026-05-12 Saurabh Rai , Sanmay Ganguly

Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step. Activation patching, also known as causal tracing or…

Machine Learning · Computer Science 2024-01-18 Fred Zhang , Neel Nanda

Mechanistic interpretability seeks to localize model behavior to the internal components that causally realize it. Prior work has advanced activation-space localization and causal tracing, but modules that appear important in activation…

Artificial Intelligence · Computer Science 2026-04-16 Chenghao Sun , Chengsheng Zhang , Guanzheng Qin , Rui Dai , Xinmei Tian

Architectural obfuscation - e.g., permuting hidden-state tensors, linearly transforming embedding tables, or remapping tokens - has recently gained traction as a lightweight substitute for heavyweight cryptography in privacy-preserving…

Cryptography and Security · Computer Science 2025-06-24 Marcos Florencio , Thomas Barton

Recently the Transformer structure has shown good performances in graph learning tasks. However, these Transformer models directly work on graph nodes and may have difficulties learning high-level information. Inspired by the vision…

Machine Learning · Computer Science 2023-04-11 Han Gao , Xu Han , Jiaoyang Huang , Jian-Xun Wang , Li-Ping Liu

Interpreting the inner function of neural networks is crucial for the trustworthy development and deployment of these black-box models. Prior interpretability methods focus on correlation-based measures to attribute model decisions to…

Machine Learning · Computer Science 2023-06-21 Ola Ahmad , Nicolas Bereux , Loïc Baret , Vahid Hashemi , Freddy Lecue

Large Language Models such as GPTs (Generative Pre-trained Transformers) exhibit remarkable capabilities across a broad spectrum of applications. Nevertheless, due to their intrinsic complexity, these models present substantial challenges…

Machine Learning · Computer Science 2024-10-17 Ashkan Golgoon , Khashayar Filom , Arjun Ravi Kannan

Sparse dictionary learning has been a rapidly growing technique in mechanistic interpretability to attack superposition and extract more human-understandable features from model activations. We ask a further question based on the extracted…

Machine Learning · Computer Science 2024-02-20 Zhengfu He , Xuyang Ge , Qiong Tang , Tianxiang Sun , Qinyuan Cheng , Xipeng Qiu

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a…

Machine Learning · Computer Science 2023-10-31 Arthur Conmy , Augustine N. Mavor-Parker , Aengus Lynch , Stefan Heimersheim , Adrià Garriga-Alonso

One significant challenge of exploiting Graph neural networks (GNNs) in real-life scenarios is that they are always treated as black boxes, therefore leading to the requirement of interpretability. To address this, model-level…

Machine Learning · Computer Science 2025-09-22 Xiao Yue , Guangzhi Qu , Lige Gan

Mechanistic interpretability seeks to understand how Large Language Models (LLMs) represent and process information. Recent approaches based on dictionary learning and transcoders enable representing model computation in terms of sparse,…

Transparency of neural networks' internal reasoning is at the heart of interpretability research, adding to trust, safety, and understanding of these models. The field of mechanistic interpretability has recently focused on studying…

Artificial Intelligence · Computer Science 2026-04-17 Nina Żukowska , Wolfgang Stammer , Bernt Schiele , Jonas Fischer

Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation…

Machine Learning · Computer Science 2023-11-21 Aaquib Syed , Can Rager , Arthur Conmy

Mechanistic interpretability aims to understand the inner workings of large neural networks by identifying circuits, or minimal subgraphs within the model that implement algorithms responsible for performing specific tasks. These circuits…

Machine Learning · Computer Science 2024-12-06 Jatin Nainani , Sankaran Vaidyanathan , AJ Yeung , Kartik Gupta , David Jensen
‹ Prev 1 2 3 10 Next ›