Related papers: Position-aware Automatic Circuit Discovery

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits identified in prior work consist of…

Machine Learning · Computer Science 2025-03-28 Samuel Marks , Can Rager , Eric J. Michaud , Yonatan Belinkov , David Bau , Aaron Mueller

Data-driven Circuit Discovery for Interpretability of Language Models

Circuit discovery aims to explain how language models (LMs) implement a specific task by localizing and interpreting a circuit, a computational subgraph responsible for the LM's behavior. Existing circuit discovery methods are…

Artificial Intelligence · Computer Science 2026-05-12 Daking Rai , Mor Geva , Ziyu Yao

Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT

Sparse dictionary learning has been a rapidly growing technique in mechanistic interpretability to attack superposition and extract more human-understandable features from model activations. We ask a further question based on the extracted…

Machine Learning · Computer Science 2024-02-20 Zhengfu He , Xuyang Ge , Qiong Tang , Tianxiang Sun , Qinyuan Cheng , Xipeng Qiu

Automatic Discovery of Visual Circuits

To date, most discoveries of network subcomponents that implement human-interpretable computations in deep vision models have involved close study of single units and large amounts of human labor. We explore scalable methods for extracting…

Computer Vision and Pattern Recognition · Computer Science 2024-04-23 Achyuta Rajaram , Neil Chowdhury , Antonio Torralba , Jacob Andreas , Sarah Schwettmann

How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits

The circuits framework in mechanistic interpretability aims to identify causally important sparse subgraphs of model components, typically evaluated by measuring necessity and sufficiency. We measure circuit reuse, the proportion of…

Computation and Language · Computer Science 2026-05-12 Michael Li , Nishant Subramani

Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models

This paper introduces an efficient and robust method for discovering interpretable circuits in large language models using discrete sparse autoencoders. Our approach addresses key limitations of existing techniques, namely computational…

Computation and Language · Computer Science 2024-05-22 Charles O'Neill , Thang Bui

Automatically Identifying Local and Global Circuits with Linear Computation Graphs

Circuit analysis of any certain model behavior is a central task in mechanistic interpretability. We introduce our circuit discovery pipeline with Sparse Autoencoders (SAEs) and a variant called Transcoders. With these two modules inserted…

Machine Learning · Computer Science 2024-07-23 Xuyang Ge , Fukang Zhu , Wentao Shu , Junxuan Wang , Zhengfu He , Xipeng Qiu

Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees

*Automated circuit discovery* is a central tool in mechanistic interpretability for identifying the internal components of neural networks responsible for specific behaviors. While prior methods have made significant progress, they…

Machine Learning · Computer Science 2026-02-20 Itamar Hadad , Guy Katz , Shahaf Bassan

Attribution Patching Outperforms Automated Circuit Discovery

Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation…

Machine Learning · Computer Science 2023-11-21 Aaquib Syed , Can Rager , Arthur Conmy

Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models

A fundamental question in interpretability research is to what extent neural networks, particularly language models, implement reusable functions through subnetworks that can be composed to perform more complex tasks. Recent advances in…

Machine Learning · Computer Science 2025-06-24 Philipp Mondorf , Sondre Wold , Barbara Plank

Skill Path: Unveiling Language Skills from Circuit Graphs

Circuit graph discovery has emerged as a fundamental approach to elucidating the skill mechanistic of language models. Despite the output faithfulness of circuit graphs, they suffer from atomic ablation, which causes the loss of causal…

Computation and Language · Computer Science 2025-11-11 Hang Chen , Jiaying Zhu , Xinyu Yang , Wenya Wang

Towards Automated Circuit Discovery for Mechanistic Interpretability

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a…

Machine Learning · Computer Science 2023-10-31 Arthur Conmy , Augustine N. Mavor-Parker , Aengus Lynch , Stefan Heimersheim , Adrià Garriga-Alonso

Uncovering Intermediate Variables in Transformers using Circuit Probing

Neural network models have achieved high performance on a wide variety of complex tasks, but the algorithms that they implement are notoriously difficult to interpret. It is often necessary to hypothesize intermediate variables involved in…

Computation and Language · Computer Science 2025-02-13 Michael A. Lepori , Thomas Serre , Ellie Pavlick

Circuit design in biology and machine learning. II. Anomaly detection

Anomaly detection is a well-established field in machine learning, identifying observations that deviate from typical patterns. The principles of anomaly detection could enhance our understanding of how biological systems recognize and…

Populations and Evolution · Quantitative Biology 2025-10-30 Steven A. Frank

Fine-Tuning is Subgraph Search: A New Lens on Learning Dynamics

The study of mechanistic interpretability aims to reverse-engineer a model to explain its behaviors. While recent studies have focused on the static mechanism of a certain behavior, the learning dynamics inside a model remain to be…

Machine Learning · Computer Science 2025-09-24 Yueyan Li , Wenhao Gao , Caixia Yuan , Xiaojie Wang

Anatomy of an Idiom: Tracing Non-Compositionality in Language Models

We investigate the processing of idiomatic expressions in transformer-based language models using a novel set of techniques for circuit discovery and analysis. First discovering circuits via a modified path patching algorithm, we find that…

Computation and Language · Computer Science 2025-11-21 Andrew Gomes

Query Circuits: Explaining How Language Models Answer User Prompts

Explaining why a language model produces a particular output requires local, input-level explanations. Existing methods uncover global capability circuits (e.g., indirect object identification), but not why the model answers a specific…

Artificial Intelligence · Computer Science 2025-09-30 Tung-Yu Wu , Fazl Barez

ADAG: Automatically Describing Attribution Graphs

In language model interpretability research, \textbf{circuit tracing} aims to identify which internal features causally contributed to a particular output and how they affected each other, with the goal of explaining the computations…

Computation and Language · Computer Science 2026-04-10 Aryaman Arora , Zhengxuan Wu , Jacob Steinhardt , Sarah Schwettmann

Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition

Automated mechanistic interpretation research has attracted great interest due to its potential to scale explanations of neural network internals to large models. Existing automated circuit discovery work relies on activation patching or…

Artificial Intelligence · Computer Science 2025-03-04 Aliyah R. Hsu , Georgia Zhou , Yeshwanth Cherapanamjeri , Yaxuan Huang , Anobel Y. Odisho , Peter R. Carroll , Bin Yu

Understanding Language Model Circuits through Knowledge Editing

Recent advances in language model interpretability have identified circuits, critical subnetworks that replicate model behaviors, yet how knowledge is structured within these crucial subnetworks remains opaque. To gain an understanding…

Computation and Language · Computer Science 2025-07-17 Huaizhi Ge , Frank Rudzicz , Zining Zhu