English
Related papers

Related papers: Position-aware Automatic Circuit Discovery

200 papers

We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits identified in prior work consist of…

Machine Learning · Computer Science 2025-03-28 Samuel Marks , Can Rager , Eric J. Michaud , Yonatan Belinkov , David Bau , Aaron Mueller

Circuit discovery aims to explain how language models (LMs) implement a specific task by localizing and interpreting a circuit, a computational subgraph responsible for the LM's behavior. Existing circuit discovery methods are…

Artificial Intelligence · Computer Science 2026-05-12 Daking Rai , Mor Geva , Ziyu Yao

Sparse dictionary learning has been a rapidly growing technique in mechanistic interpretability to attack superposition and extract more human-understandable features from model activations. We ask a further question based on the extracted…

Machine Learning · Computer Science 2024-02-20 Zhengfu He , Xuyang Ge , Qiong Tang , Tianxiang Sun , Qinyuan Cheng , Xipeng Qiu

To date, most discoveries of network subcomponents that implement human-interpretable computations in deep vision models have involved close study of single units and large amounts of human labor. We explore scalable methods for extracting…

Computer Vision and Pattern Recognition · Computer Science 2024-04-23 Achyuta Rajaram , Neil Chowdhury , Antonio Torralba , Jacob Andreas , Sarah Schwettmann

The circuits framework in mechanistic interpretability aims to identify causally important sparse subgraphs of model components, typically evaluated by measuring necessity and sufficiency. We measure circuit reuse, the proportion of…

Computation and Language · Computer Science 2026-05-12 Michael Li , Nishant Subramani

This paper introduces an efficient and robust method for discovering interpretable circuits in large language models using discrete sparse autoencoders. Our approach addresses key limitations of existing techniques, namely computational…

Computation and Language · Computer Science 2024-05-22 Charles O'Neill , Thang Bui

Circuit analysis of any certain model behavior is a central task in mechanistic interpretability. We introduce our circuit discovery pipeline with Sparse Autoencoders (SAEs) and a variant called Transcoders. With these two modules inserted…

Machine Learning · Computer Science 2024-07-23 Xuyang Ge , Fukang Zhu , Wentao Shu , Junxuan Wang , Zhengfu He , Xipeng Qiu

*Automated circuit discovery* is a central tool in mechanistic interpretability for identifying the internal components of neural networks responsible for specific behaviors. While prior methods have made significant progress, they…

Machine Learning · Computer Science 2026-02-20 Itamar Hadad , Guy Katz , Shahaf Bassan

Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation…

Machine Learning · Computer Science 2023-11-21 Aaquib Syed , Can Rager , Arthur Conmy

A fundamental question in interpretability research is to what extent neural networks, particularly language models, implement reusable functions through subnetworks that can be composed to perform more complex tasks. Recent advances in…

Machine Learning · Computer Science 2025-06-24 Philipp Mondorf , Sondre Wold , Barbara Plank

Circuit graph discovery has emerged as a fundamental approach to elucidating the skill mechanistic of language models. Despite the output faithfulness of circuit graphs, they suffer from atomic ablation, which causes the loss of causal…

Computation and Language · Computer Science 2025-11-11 Hang Chen , Jiaying Zhu , Xinyu Yang , Wenya Wang

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a…

Machine Learning · Computer Science 2023-10-31 Arthur Conmy , Augustine N. Mavor-Parker , Aengus Lynch , Stefan Heimersheim , Adrià Garriga-Alonso

Neural network models have achieved high performance on a wide variety of complex tasks, but the algorithms that they implement are notoriously difficult to interpret. It is often necessary to hypothesize intermediate variables involved in…

Computation and Language · Computer Science 2025-02-13 Michael A. Lepori , Thomas Serre , Ellie Pavlick

Anomaly detection is a well-established field in machine learning, identifying observations that deviate from typical patterns. The principles of anomaly detection could enhance our understanding of how biological systems recognize and…

Populations and Evolution · Quantitative Biology 2025-10-30 Steven A. Frank

The study of mechanistic interpretability aims to reverse-engineer a model to explain its behaviors. While recent studies have focused on the static mechanism of a certain behavior, the learning dynamics inside a model remain to be…

Machine Learning · Computer Science 2025-09-24 Yueyan Li , Wenhao Gao , Caixia Yuan , Xiaojie Wang

We investigate the processing of idiomatic expressions in transformer-based language models using a novel set of techniques for circuit discovery and analysis. First discovering circuits via a modified path patching algorithm, we find that…

Computation and Language · Computer Science 2025-11-21 Andrew Gomes

Explaining why a language model produces a particular output requires local, input-level explanations. Existing methods uncover global capability circuits (e.g., indirect object identification), but not why the model answers a specific…

Artificial Intelligence · Computer Science 2025-09-30 Tung-Yu Wu , Fazl Barez

In language model interpretability research, \textbf{circuit tracing} aims to identify which internal features causally contributed to a particular output and how they affected each other, with the goal of explaining the computations…

Computation and Language · Computer Science 2026-04-10 Aryaman Arora , Zhengxuan Wu , Jacob Steinhardt , Sarah Schwettmann

Automated mechanistic interpretation research has attracted great interest due to its potential to scale explanations of neural network internals to large models. Existing automated circuit discovery work relies on activation patching or…

Artificial Intelligence · Computer Science 2025-03-04 Aliyah R. Hsu , Georgia Zhou , Yeshwanth Cherapanamjeri , Yaxuan Huang , Anobel Y. Odisho , Peter R. Carroll , Bin Yu

Recent advances in language model interpretability have identified circuits, critical subnetworks that replicate model behaviors, yet how knowledge is structured within these crucial subnetworks remains opaque. To gain an understanding…

Computation and Language · Computer Science 2025-07-17 Huaizhi Ge , Frank Rudzicz , Zining Zhu
‹ Prev 1 2 3 10 Next ›