Related papers: Weight-sparse transformers have interpretable circ…

Sparse but not Simpler: A Multi-Level Interpretability Analysis of Vision Transformers

Sparse neural networks are often hypothesized to be more interpretable than dense models, motivated by findings that weight sparsity can produce compact circuits in language models. However, it remains unclear whether structural sparsity…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Siyu Zhang

Transcoders Find Interpretable LLM Feature Circuits

A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based language…

Machine Learning · Computer Science 2024-11-08 Jacob Dunefsky , Philippe Chlenski , Neel Nanda

Learning Transformer Programs

Recent research in mechanistic interpretability has attempted to reverse-engineer Transformer models by carefully inspecting network weights and activations. However, these approaches require considerable manual effort and still fall short…

Machine Learning · Computer Science 2023-11-01 Dan Friedman , Alexander Wettig , Danqi Chen

Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models

This paper introduces an efficient and robust method for discovering interpretable circuits in large language models using discrete sparse autoencoders. Our approach addresses key limitations of existing techniques, namely computational…

Computation and Language · Computer Science 2024-05-22 Charles O'Neill , Thang Bui

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits identified in prior work consist of…

Machine Learning · Computer Science 2025-03-28 Samuel Marks , Can Rager , Eric J. Michaud , Yonatan Belinkov , David Bau , Aaron Mueller

Partially Rewriting a Transformer in Natural Language

The greatest ambition of mechanistic interpretability is to completely rewrite deep neural networks in a format that is more amenable to human understanding, while preserving their behavior and performance. In this paper, we attempt to…

Machine Learning · Computer Science 2025-02-03 Gonçalo Paulo , Nora Belrose

Circuit Insights: Towards Interpretability Beyond Activations

The fields of explainable AI and mechanistic interpretability aim to uncover the internal structure of neural networks, with circuit discovery as a central tool for understanding model computations. Existing approaches, however, rely on…

Machine Learning · Computer Science 2026-03-05 Elena Golimblevskaia , Aakriti Jain , Bruno Puri , Ammar Ibrahim , Wojciech Samek , Sebastian Lapuschkin

Word Equations: Inherently Interpretable Sparse Word Embeddingsthrough Sparse Coding

Word embeddings are a powerful natural language processing technique, but they are extremely difficult to interpret. To enable interpretable NLP models, we create vectors where each dimension is inherently interpretable. By inherently…

Computation and Language · Computer Science 2021-09-29 Adly Templeton

Language Model Circuits Are Sparse in the Neuron Basis

The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques such as \textit{sparse…

Computation and Language · Computer Science 2026-02-02 Aryaman Arora , Zhengxuan Wu , Jacob Steinhardt , Sarah Schwettmann

Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning

Transformer-based language models have achieved significant success; however, their internal mechanisms remain largely opaque due to the complexity of non-linear interactions and high-dimensional operations. While previous studies have…

Artificial Intelligence · Computer Science 2025-02-17 Lin Zhang , Lijie Hu , Di Wang

Multi-Scale Grouped Prototypes for Interpretable Semantic Segmentation

Prototypical part learning is emerging as a promising approach for making semantic segmentation interpretable. The model selects real patches seen during training as prototypes and constructs the dense prediction map based on the similarity…

Computer Vision and Pattern Recognition · Computer Science 2025-04-29 Hugo Porta , Emanuele Dalsasso , Diego Marcos , Devis Tuia

Interpretable Neural Embeddings with Sparse Self-Representation

Interpretability benefits the theoretical understanding of representations. Existing word embeddings are generally dense representations. Hence, the meaning of latent dimensions is difficult to interpret. This makes word embeddings like a…

Computation and Language · Computer Science 2023-06-27 Minxue Xia , Hao Zhu

Discursive Circuits: How Do Language Models Understand Discourse Relations?

Which components in transformer language models are responsible for discourse understanding? We hypothesize that sparse computational graphs, termed as discursive circuits, control how models process discourse relations. Unlike simpler…

Computation and Language · Computer Science 2025-10-14 Yisong Miao , Min-Yen Kan

Neural Generators of Sparse Local Linear Models for Achieving both Accuracy and Interpretability

For reliability, it is important that the predictions made by machine learning methods are interpretable by human. In general, deep neural networks (DNNs) can provide accurate predictions, although it is difficult to interpret why such…

Machine Learning · Computer Science 2021-12-16 Yuya Yoshikawa , Tomoharu Iwata

Learning and Evaluating Sparse Interpretable Sentence Embeddings

Previous research on word embeddings has shown that sparse representations, which can be either learned on top of existing dense embeddings or obtained through model constraints during training time, have the benefit of increased…

Computation and Language · Computer Science 2018-09-26 Valentin Trifonov , Octavian-Eugen Ganea , Anna Potapenko , Thomas Hofmann

The Price of Interpretability

When quantitative models are used to support decision-making on complex and important topics, understanding a model's ``reasoning'' can increase trust in its predictions, expose hidden biases, or reduce vulnerability to adversarial attacks.…

Machine Learning · Computer Science 2019-07-09 Dimitris Bertsimas , Arthur Delarue , Patrick Jaillet , Sebastien Martin

From Neurons to Neutrons: A Case Study in Interpretability

Mechanistic Interpretability (MI) promises a path toward fully understanding how neural networks make their predictions. Prior work demonstrates that even when trained to perform simple arithmetic, models can implement a variety of…

Machine Learning · Computer Science 2024-05-28 Ouail Kitouni , Niklas Nolte , Víctor Samuel Pérez-Díaz , Sokratis Trifinopoulos , Mike Williams

Scale Alone Does not Improve Mechanistic Interpretability in Vision Models

In light of the recent widespread adoption of AI systems, understanding the internal information processing of neural networks has become increasingly critical. Most recently, machine vision has seen remarkable progress by scaling neural…

Computer Vision and Pattern Recognition · Computer Science 2024-04-02 Roland S. Zimmermann , Thomas Klein , Wieland Brendel

The Mythos of Model Interpretability

Supervised machine learning models boast remarkable predictive capabilities. But can you trust your model? Will it work in deployment? What else can it tell you about the world? We want models to be not only good, but interpretable. And yet…

Machine Learning · Computer Science 2017-03-07 Zachary C. Lipton

From Human Explanation to Model Interpretability: A Framework Based on Weight of Evidence

We take inspiration from the study of human explanation to inform the design and evaluation of interpretability methods in machine learning. First, we survey the literature on human explanation in philosophy, cognitive science, and the…

Artificial Intelligence · Computer Science 2021-09-21 David Alvarez-Melis , Harmanpreet Kaur , Hal Daumé , Hanna Wallach , Jennifer Wortman Vaughan